Skip to article frontmatterSkip to article content

Week 9: Vision

How vision shapes our perception and experience of sound and music

This week, we will explore the role of vision in sound and music perception and cognition. Vision is not only essential for navigating our environment but also plays a significant part in how we experience and interpret sound and music. Visual cues can influence how we perceive auditory information, shape our expectations, and enhance our understanding of complex musical performances. For example, watching a musician’s sound-producing actions or a conductor’s gestures can affect how we interpret rhythm, timing, and emotional expression in music. We will discuss how the brain integrates visual and auditory information, examine phenomena such as audiovisual illusions, and consider the importance of visual feedback in musical learning and performance. Through examples and experiments, we will see how vision and hearing work together to create richer, more immersive experiences in both everyday life and artistic contexts.

Audiovisuality

Before delving into the anatomy of the eye or the use of eye tracking in music performance and perception research, it’s essential to clarify how vision and audition interact, especially in the context of music psychology and technology.

Auditory-Visual vs. Audio-Video

In this course, it’s important to distinguish between the terms auditory/visual and audio/video, as they refer to different domains:

Understanding both the physical phenomena (sound and light), how they are perceived (through auditory and visual modalities) and their technical representations (audio and video) is crucial for grasping how audiovisual systems function and how humans perceive and interact with their environment—particularly in music, where the integration of auditory and visual information shapes our experience.

Integration of Senses

Auditory–visual integration describes the brain’s ability to combine sound and sight to enhance perception and comprehension; this process is essential for tasks like speech perception and for richer multimedia experiences.

Multimodal perception refers to the integration of information from multiple sensory modalities (e.g., sight, sound, touch) to form a unified understanding of the world.

Some interesting crossmodal effects can come out of such multimodal perception. One famous example of an auditory-visual illusion is the McGurk effect. Check this video:

When the visual mouth movements of one speech sound are paired with the acoustic signal of another, observers often perceive a third, fused sound (classically: audio “ba” + visual “ga” → perceived “da”). It illustrates that vision can alter basic perceptual interpretation of sound, not just higher‑level judgments. The effect arises because the brain combines temporally coincident but conflicting cues across modalities within a limited temporal window; cortical areas such as the superior temporal sulcus are implicated in resolving the mismatch. Factors that modulate the effect include timing (asynchrony reduces fusion), signal clarity (degraded audio strengthens visual influence), attention, and individual differences (e.g., lip‑reading skill).

In music contexts the same principles explain why seeing a performer’s gestures, facial movements, or instrument actions can change perceived onset, articulation, or expressiveness of sound.

The Eye

This section gives a clear, practical overview of the eye and why it matters for studies of sound, music, and performance.

Basic anatomy and optics

The human eye is a highly evolved optical organ that focuses light onto the retina, its photosensitive inner surface, enabling us to see with remarkable clarity. The eye adapts dynamically to changing light conditions: in bright environments the pupil constricts to sharpen the image and reduce optical distortions; in dim light it dilates to maximize the number of photons reaching the retinal photoreceptors.

Humans, as diurnal animals, possess eyes with high optical resolution—surpassed only by a few species, such as birds of prey. During evolution, the human cornea (the transparent front surface of the eye) became the primary structure for image formation, while the lens, located just behind the pupil, fine-tunes the focus so that images are sharply projected onto the retina.

The iris, a ring of pigmented muscle tissue, controls the size of the pupil. Its main function is not simply to regulate brightness, but to ensure optimal visual resolution under varying lighting conditions. The pupil’s diameter ranges from about 2 to 8 mm, yet natural light levels can vary by a factor of a million (e.g., from moonlight to sunlight).

Anatomy of the Eye.

The retina

The retina turns light into neural signals and does early processing like enhancing contrast, finding edges, and detecting motion. It has three main light sensors: rods (very sensitive, work in dim light and see in shades of gray), cones (work in bright light, provide color vision and are packed in the fovea), and ipRGCs (intrinsically photosensitive retinal ganglion cells that help control pupil size and body rhythms).

Eye movements and attention

Eyes move in a few basic ways: saccades are very fast jumps that put the sharp center of vision (the fovea) on a new spot, with fixations lasting around 200–400 ms; smooth pursuit lets the eyes follow a moving object smoothly when you can see it; and vergence changes the angle of the two eyes so you can focus on near versus far things. In music settings, where someone looks (for example at a face, the hands, or the score) strongly determines what visual information they pick up and that visual input can change how they hear, understand, or perform the music.

Pupil control

Pupil size is set by the iris muscles and controlled by the autonomic nervous system; it balances how much light enters the eye with image sharpness (a small pupil gives more depth of field and a sharper image, a large pupil lets in more light). Pupil changes also reflect non‑visual states like mental effort, surprise, or emotional arousal — linked to brain arousal systems (for example the locus coeruleus and noradrenaline) — and these changes can be measured with pupillometry. When using or interpreting pupil measurements, remember that lighting, task difficulty, and emotional context all affect pupil size.

Eye Tracking and Pupillometry

Eye Tracking

Eye tracking is a technique used to measure where and how the eyes move, providing insights into visual attention, perception, and cognitive processes. Eye trackers are widely used in psychology, neuroscience, marketing, usability studies, and human-computer interaction.

There are two main types of eye trackers: mobile and stationary. Mobile eye trackers are wearable devices (such as glasses) that allow for eye movement recording in natural, real-world environments. These are useful for studies involving movement, such as sports, navigation, or field research.

Mobile Eye-Tracker
Figure 1: Mobile eye-tracking glasses in use during MusicLab Abels KORK. Image credit: Simen Kjellin/UiO.

Stationary eye trackers are fixed devices, often mounted on a desk or integrated into a monitor, used in controlled laboratory settings. These are ideal for experiments on reading, visual search, or website usability.

Stationary Eye-Tracker
Figure 2: Example of stationary eye-tracking software.

Eye tracking data can reveal patterns of gaze, fixations, and saccades, helping researchers understand how people process visual information and allocate attention.

Gaze tracking

Gaze denotes the direction of a person’s visual attention and is a primary observable proxy for where cognitive resources are allocated. In music research, gaze reveals what performers and listeners attend to (hands, face, score, conductor), how information is sampled over time, and how visual cues shape auditory perception.

Gaze tracking

Eye fixations of different individuals (color circles) and for different durations (size of each circle). Image credit: Bruno Laeng/UiO.

Key gaze events and properties:

Gaze is a useful proxy for attention but not a perfect synonym—people can covertly attend without moving their eyes; calibration, sampling rate, and tracker type (mobile vs stationary) determine accuracy and which metrics are reliable (e.g., >250 Hz for precise saccade dynamics); define AOIs carefully (use dynamic AOIs for moving performers) and account for head motion in mobile settings; and always interpret gaze patterns relative to task demands, participant expertise, and stimulus timing (for example, anticipatory fixations in sight‑reading).

Pupillometry

Pupillometry is the measurement of pupil size and its changes over time. Because the pupil responds not only to light but also to cognitive and emotional states, pupillometry is a valuable tool for studying mental effort, arousal, and attention.

Pupillometry example

Checking the pupil size before MusicLab Copenhagen.

Daniel Kahneman, Nobel laureate (2002) and author of Thinking, Fast and Slow (2011), once remarked:

“The pupils reflect the extent of mental effort in an incredibly precise way [...] I have never done any work in which the measurement is so precise.”

Research over decades shows that pupil diameter reliably tracks mental workload. In a classic study, Eckhard Hess (1964) reported systematic pupil dilation as participants solved progressively harder arithmetic problems; hundreds of subsequent studies have replicated and extended this finding, demonstrating that larger or phasic pupil responses often accompany increased cognitive and attentional demands.

Eye tracking and pupillometry in music research

Eye tracking is now more used in music research as hardware and software have become more accessible. Typical roles and use-cases include:

These examples illustrate the practical applications of eye tracking and pupillometry in music research, education, and performance, highlighting how vision and audition interact in complex and meaningful ways.

Questions for Review

  1. How do visual cues influence our perception and interpretation of musical performances?

  2. What are the main anatomical structures of the human eye involved in vision, and how do they contribute to image formation?

  3. Explain the difference between audio/video and auditory/visual in the context of music perception and technology.

  4. How can eye tracking and pupillometry be used to study music performance and perception?

  5. What is the role of the autonomic nervous system in controlling pupil size, and how does this relate to attention and arousal?