This week, we will explore the role of vision in sound and music perception and cognition. Vision is not only essential for navigating our environment but also plays a significant part in how we experience and interpret sound and music. Visual cues can influence how we perceive auditory information, shape our expectations, and enhance our understanding of complex musical performances. For example, watching a musician’s sound-producing actions or a conductor’s gestures can affect how we interpret rhythm, timing, and emotional expression in music. We will discuss how the brain integrates visual and auditory information, examine phenomena such as audiovisual illusions, and consider the importance of visual feedback in musical learning and performance. Through examples and experiments, we will see how vision and hearing work together to create richer, more immersive experiences in both everyday life and artistic contexts.
Audiovisuality¶
Before delving into the anatomy of the eye or the use of eye tracking in music performance and perception research, it’s essential to clarify how vision and audition interact, especially in the context of music psychology and technology.
Auditory-Visual vs. Audio-Video¶
In this course, it’s important to distinguish between the terms auditory/visual and audio/video, as they refer to different domains:
Audio/Video: These terms relate to the technological capture, processing, and reproduction of sound and moving images. Audio refers to the recording, transmission, and playback of sound signals, whether analog or digital. Video involves the recording, processing, and display of moving images.
Auditory/Visual: These terms denote the biological sensory modalities and their perceptual processes. The auditory system transduces air-pressure fluctuations (sound) into neural signals that encode pitch, loudness, timbre and spatial position. The visual system transduces visible electromagnetic radiation (light) into neural representations of luminance, color, form, motion and depth. Perception arises from both the physical properties of stimuli (frequency/wavelength, intensity) and neural processing (adaptation, attention, context), and underlies how we integrate cues across senses.
Understanding both the physical phenomena (sound and light), how they are perceived (through auditory and visual modalities) and their technical representations (audio and video) is crucial for grasping how audiovisual systems function and how humans perceive and interact with their environment—particularly in music, where the integration of auditory and visual information shapes our experience.
Integration of Senses¶
Auditory–visual integration describes the brain’s ability to combine sound and sight to enhance perception and comprehension; this process is essential for tasks like speech perception and for richer multimedia experiences.
Multimodal perception refers to the integration of information from multiple sensory modalities (e.g., sight, sound, touch) to form a unified understanding of the world.
Some interesting crossmodal effects can come out of such multimodal perception. One famous example of an auditory-visual illusion is the McGurk effect. Check this video:
When the visual mouth movements of one speech sound are paired with the acoustic signal of another, observers often perceive a third, fused sound (classically: audio “ba” + visual “ga” → perceived “da”). It illustrates that vision can alter basic perceptual interpretation of sound, not just higher‑level judgments. The effect arises because the brain combines temporally coincident but conflicting cues across modalities within a limited temporal window; cortical areas such as the superior temporal sulcus are implicated in resolving the mismatch. Factors that modulate the effect include timing (asynchrony reduces fusion), signal clarity (degraded audio strengthens visual influence), attention, and individual differences (e.g., lip‑reading skill).
In music contexts the same principles explain why seeing a performer’s gestures, facial movements, or instrument actions can change perceived onset, articulation, or expressiveness of sound.
The Eye¶
This section gives a clear, practical overview of the eye and why it matters for studies of sound, music, and performance.
Basic anatomy and optics¶
The human eye is a highly evolved optical organ that focuses light onto the retina, its photosensitive inner surface, enabling us to see with remarkable clarity. The eye adapts dynamically to changing light conditions: in bright environments the pupil constricts to sharpen the image and reduce optical distortions; in dim light it dilates to maximize the number of photons reaching the retinal photoreceptors.
Humans, as diurnal animals, possess eyes with high optical resolution—surpassed only by a few species, such as birds of prey. During evolution, the human cornea (the transparent front surface of the eye) became the primary structure for image formation, while the lens, located just behind the pupil, fine-tunes the focus so that images are sharply projected onto the retina.
The iris, a ring of pigmented muscle tissue, controls the size of the pupil. Its main function is not simply to regulate brightness, but to ensure optimal visual resolution under varying lighting conditions. The pupil’s diameter ranges from about 2 to 8 mm, yet natural light levels can vary by a factor of a million (e.g., from moonlight to sunlight).

The retina¶
The retina turns light into neural signals and does early processing like enhancing contrast, finding edges, and detecting motion. It has three main light sensors: rods (very sensitive, work in dim light and see in shades of gray), cones (work in bright light, provide color vision and are packed in the fovea), and ipRGCs (intrinsically photosensitive retinal ganglion cells that help control pupil size and body rhythms).
Eye movements and attention¶
Eyes move in a few basic ways: saccades are very fast jumps that put the sharp center of vision (the fovea) on a new spot, with fixations lasting around 200–400 ms; smooth pursuit lets the eyes follow a moving object smoothly when you can see it; and vergence changes the angle of the two eyes so you can focus on near versus far things. In music settings, where someone looks (for example at a face, the hands, or the score) strongly determines what visual information they pick up and that visual input can change how they hear, understand, or perform the music.
Pupil control¶
Pupil size is set by the iris muscles and controlled by the autonomic nervous system; it balances how much light enters the eye with image sharpness (a small pupil gives more depth of field and a sharper image, a large pupil lets in more light). Pupil changes also reflect non‑visual states like mental effort, surprise, or emotional arousal — linked to brain arousal systems (for example the locus coeruleus and noradrenaline) — and these changes can be measured with pupillometry. When using or interpreting pupil measurements, remember that lighting, task difficulty, and emotional context all affect pupil size.
Eye Tracking and Pupillometry¶
Eye Tracking¶
Eye tracking is a technique used to measure where and how the eyes move, providing insights into visual attention, perception, and cognitive processes. Eye trackers are widely used in psychology, neuroscience, marketing, usability studies, and human-computer interaction.
There are two main types of eye trackers: mobile and stationary. Mobile eye trackers are wearable devices (such as glasses) that allow for eye movement recording in natural, real-world environments. These are useful for studies involving movement, such as sports, navigation, or field research.

Figure 1: Mobile eye-tracking glasses in use during MusicLab Abels KORK. Image credit: Simen Kjellin/UiO.
Stationary eye trackers are fixed devices, often mounted on a desk or integrated into a monitor, used in controlled laboratory settings. These are ideal for experiments on reading, visual search, or website usability.

Figure 2: Example of stationary eye-tracking software.
Eye tracking data can reveal patterns of gaze, fixations, and saccades, helping researchers understand how people process visual information and allocate attention.
Gaze tracking¶
Gaze denotes the direction of a person’s visual attention and is a primary observable proxy for where cognitive resources are allocated. In music research, gaze reveals what performers and listeners attend to (hands, face, score, conductor), how information is sampled over time, and how visual cues shape auditory perception.
Eye fixations of different individuals (color circles) and for different durations (size of each circle). Image credit: Bruno Laeng/UiO.
Key gaze events and properties:
Fixations: brief periods (typically ~200–400 ms, variable by task) when the fovea is held on a location and detailed processing occurs. Fixation count and duration are common measures of interest or difficulty.
Saccades: rapid relocations of gaze (tens of ms) that reorient foveal vision; saccade amplitude and direction reveal scanning strategies but carry little perceptual detail.
Smooth pursuit: continuous tracking of a moving target; occurs only when a visible moving stimulus is followed.
Scanpaths and transitions: sequences of fixations and saccades that describe viewing strategies and can be summarized with transition matrices or string-based methods.
Gaze is a useful proxy for attention but not a perfect synonym—people can covertly attend without moving their eyes; calibration, sampling rate, and tracker type (mobile vs stationary) determine accuracy and which metrics are reliable (e.g., >250 Hz for precise saccade dynamics); define AOIs carefully (use dynamic AOIs for moving performers) and account for head motion in mobile settings; and always interpret gaze patterns relative to task demands, participant expertise, and stimulus timing (for example, anticipatory fixations in sight‑reading).
Pupillometry¶
Pupillometry is the measurement of pupil size and its changes over time. Because the pupil responds not only to light but also to cognitive and emotional states, pupillometry is a valuable tool for studying mental effort, arousal, and attention.

Checking the pupil size before MusicLab Copenhagen.
Daniel Kahneman, Nobel laureate (2002) and author of Thinking, Fast and Slow (2011), once remarked:
“The pupils reflect the extent of mental effort in an incredibly precise way [...] I have never done any work in which the measurement is so precise.”
Research over decades shows that pupil diameter reliably tracks mental workload. In a classic study, Eckhard Hess (1964) reported systematic pupil dilation as participants solved progressively harder arithmetic problems; hundreds of subsequent studies have replicated and extended this finding, demonstrating that larger or phasic pupil responses often accompany increased cognitive and attentional demands.
Eye tracking and pupillometry in music research¶
Eye tracking is now more used in music research as hardware and software have become more accessible. Typical roles and use-cases include:
Concert performance analysis: Eye tracking is used to study how audience members watch performers, revealing which gestures or movements draw the most attention and how visual cues influence the perception of musical expressiveness.
Music reading and learning: Researchers use eye tracking to analyze how musicians read sheet music, identifying patterns in gaze, fixation, and saccades that relate to expertise and sight-reading ability.
Audiovisual illusions: Experiments combine sound and video (e.g., the McGurk effect) to demonstrate how conflicting visual and auditory cues can alter perception, showing the integration of senses.
Emotional response measurement: Pupillometry is employed to track changes in pupil size as listeners experience emotionally charged music, providing insights into arousal and engagement.
Performance anxiety studies: Both eye tracking and pupillometry are used to assess how musicians’ gaze patterns and pupil responses change under stress, helping to understand cognitive load and attentional focus during live performance.
Interactive installations: In music technology and art, eye tracking can be used to control sound parameters or trigger musical events based on where a participant looks, creating immersive audiovisual experiences.
These examples illustrate the practical applications of eye tracking and pupillometry in music research, education, and performance, highlighting how vision and audition interact in complex and meaningful ways.
Questions for Review¶
How do visual cues influence our perception and interpretation of musical performances?
What are the main anatomical structures of the human eye involved in vision, and how do they contribute to image formation?
Explain the difference between audio/video and auditory/visual in the context of music perception and technology.
How can eye tracking and pupillometry be used to study music performance and perception?
What is the role of the autonomic nervous system in controlling pupil size, and how does this relate to attention and arousal?