Skip to article frontmatterSkip to article content

Week 6: Harmony and melody

Exploring harmony, melody, and musical structure

University of Oslo

Frequency is a fundamental concept in music and sound, referring to the number of vibrations or cycles per second of a sound wave, measured in Hertz (Hz). It is the basis for understanding pitch, tone, and other auditory phenomena.

Tones

In music psychology, a tone is understood as a sound with a specific frequency and timbral quality that the auditory system interprets as having a definite pitch. Our perception of tones is shaped by both the physical properties of the sound wave (such as frequency, amplitude, and harmonic content) and the way our brains process these signals. Tones are the building blocks of musical perception, allowing us to distinguish melodies, harmonies, and timbres.

From a technological perspective, tones can be generated, analyzed, and manipulated using digital tools. Synthesizers create tones by combining waveforms, while audio analysis software can extract pitch and timbre features from recordings. Technologies such as pitch detection algorithms and spectral analysis are essential for applications in music information retrieval, automatic transcription, and digital instrument design.

Tones are not the same as notes: while a note refers to a symbolic representation in musical notation (with defined pitch, duration, and sometimes dynamics), a tone refers to the auditory experience itself.

Analysis-by-Synthesis

Analysis-by-Synthesis is a method used in sound and music research to understand auditory perception by recreating sounds and analyzing their properties. This approach is widely used in areas like speech synthesis, sound design, and music analysis.

Melody

Melody is the sequence of musical notes that are perceived as a single entity. It is often the most recognizable and memorable aspect of a musical piece. Melody plays a crucial role in emotional engagement and memory recall. Tools like MIDI editors and pitch detection algorithms are used to analyze and manipulate melodies.

From a psychological perspective, melody is the perception of a coherent sequence of tones that form a recognizable musical line. Melodies are central to musical memory and emotional response, as the brain tracks pitch contours, intervals, and rhythmic patterns to identify and recall tunes. Research in music cognition explores how listeners segment, remember, and anticipate melodic sequences.

Technological advances have enabled detailed analysis and manipulation of melody. Pitch tracking algorithms extract melodic lines from audio, while MIDI editors and music notation software allow for precise editing and visualization. In music generation and AI composition, models learn melodic patterns from large datasets to create new, stylistically consistent melodies. Melody extraction and similarity algorithms are also used in music search and recommendation systems.

Auditory stream segregation

Auditory stream segregation is the process by which the human auditory system organizes complex mixtures of sounds into perceptually meaningful elements, or “streams.” In music, this allows listeners to distinguish between different melodic lines, instruments, or voices, even when they are played simultaneously. This perceptual organization is influenced by factors such as pitch, timbre, spatial location, and timing. For example, melodies that move in different pitch ranges or have distinct timbres are more likely to be perceived as separate streams. Understanding auditory stream segregation is essential for analyzing polyphonic music, designing effective music information retrieval systems, and developing algorithms for source separation and automatic transcription. Advances in computational modeling and machine learning have enabled researchers to simulate and study how the brain separates and tracks multiple musical streams in real-world listening scenarios.

Harmony

Harmony involves the combination of discernible tones to create intervals and chords. It is the simultaneous sounding of different pitches, which can evoke a wide range of emotional responses. Harmony plays a crucial role in the emotional tone and complexity of music. Harmony encompasses the interaction of pitches through intervals and chords, shaped by timbre and texture.

Intervals

An interval is the distance between two pitches, measured in steps or frequency ratios. Intervals are the building blocks of harmony, as they define the relationships between notes played together or in succession. The human brain is sensitive to the relationships between pitches, perceiving certain combinations as consonant (pleasant or stable) and others as dissonant (tense or unstable). These perceptual responses are influenced by cultural exposure, musical training, and innate auditory processing mechanisms.

  • Types of Intervals: Intervals are named by counting the number of letter names from the lower to the higher note (e.g., C to E is a third). They can be major, minor, perfect, augmented, or diminished.
  • Consonance and Dissonance: Some intervals, like octaves and perfect fifths, are perceived as consonant, while others, like minor seconds or tritones, are more dissonant.
  • Role in Harmony: Intervals form the basis for chords and harmonic progressions. The combination of intervals within a chord determines its character and function.

Chords

A chord is a group of three or more notes played simultaneously. Chords are the foundation of Western harmony and are used to create progressions that define the structure and mood of a piece.

  • Triads: The most basic chords, consisting of three notes (root, third, fifth). Types include major, minor, diminished, and augmented triads.
  • Seventh Chords and Extensions: Adding more notes (such as sevenths, ninths, elevenths, and thirteenths) creates richer harmonies.
  • Chord Progressions: Sequences of chords that create movement and tension-resolution patterns in music (e.g., I–IV–V–I in classical music, or ii–V–I in jazz).
  • Functional Harmony: Chords have roles (tonic, dominant, subdominant) that guide the listener’s expectations.

Timbre

Recall that timbre, often called “sound color,” is the quality of a sound that distinguishes different instruments or voices, even when they produce the same pitch and loudness.

  • “Sound Color”: The unique quality that makes a violin sound different from a flute, even if both play the same note at the same loudness.
  • Spectral Content: Timbre is shaped by the harmonic content (overtones) and the way energy is distributed across frequencies.
  • Temporal Alignment: The timing of sound waves, including attack, decay, sustain, and release, contributes to timbre perception.
  • Just Noticeable Differences (JND): The smallest change in a sound property (such as frequency, amplitude, or spectral content) that can be perceived.

Texture

Texture describes how multiple layers of sound interact in a musical composition. It ranges from monophonic (a single melody) to polyphonic (multiple independent melodies).

  • Monophonic Texture: A single melodic line without accompaniment (e.g., solo singing).
  • Homophonic Texture: A main melody supported by chords or accompaniment (e.g., singer with guitar).
  • Polyphonic Texture: Two or more independent melodies played simultaneously (e.g., fugues, counterpoint).
  • Heterophonic Texture: Multiple performers play variations of the same melody at the same time.
  • Combination of Timbres: The blending of different sound qualities to create a rich texture.

Audio Visualizations

We have previously looked at waveform displays and spectrograms. However, there are also several other visualization forms that try to better capture what humans hear.

Visualizing audio is crucial for understanding both the physical properties of sound and how humans perceive it. Different representations highlight various aspects of the audio signal, making them useful for analysis, classification, and creative applications.

Waveform

A waveform is a simple plot of amplitude versus time. It shows how the air pressure (or voltage, in digital audio) changes over time. While useful for seeing the overall shape and dynamics of a sound, it does not provide detailed information about frequency content.

Spectrogram

A spectrogram displays how the frequency content of a signal changes over time. It is created by applying the Short-Time Fourier Transform (STFT) to the audio, resulting in a 2D image where the x-axis is time, the y-axis is frequency, and the color represents amplitude (often in decibels). Spectrograms are widely used for audio analysis, speech recognition, and music research.

Log Mel Spectrogram

The Log Mel Spectrogram mimics human hearing by applying the STFT, mapping the frequencies to the Mel scale (which is more perceptually relevant), and then applying a logarithmic transformation to represent the amplitude on a decibel scale. This representation compresses the frequency axis to better match how humans perceive pitch differences, making it especially useful in machine learning and audio classification tasks.

MFCCs (Mel-Frequency Cepstral Coefficients)

MFCCs are a compact representation of the spectral envelope of a sound. They are computed by taking the Log Mel Spectrogram and applying the Discrete Cosine Transform (DCT), which decorrelates the features and compresses the information. MFCCs are widely used in speech recognition, music classification, and audio similarity tasks because they capture timbral characteristics that are important for distinguishing different sounds.

CQT (Constant-Q Transform)

The Constant-Q Transform uses a logarithmic frequency scale, with exponentially spaced center frequencies and varying filter bandwidths. This makes it ideal for musical applications, as it aligns with the way musical notes are spaced (e.g., each octave is divided into equal steps). The CQT is particularly useful for tasks like pitch tracking, chord recognition, and music transcription, as it provides a more musically meaningful frequency representation than the linear STFT.

Summary Table

VisualizationWhat it showsTypical Use Cases
WaveformAmplitude vs. timeEditing, dynamics, onset detection
SpectrogramFrequency vs. time (linear)Audio analysis, speech/music research
Log Mel SpectrogramPerceptual frequency vs. timeMachine learning, audio classification
MFCCsCompressed spectral envelopeSpeech/music recognition, feature extraction
CQTMusical pitch vs. timePitch tracking, chord recognition, MIR

These visualizations are implemented in Python using libraries such as librosa and matplotlib, as shown in the code below. Each representation provides unique insights into the structure and content of audio signals, supporting both scientific analysis and creative exploration.

Source
import numpy as np
import librosa

import matplotlib.pyplot as plt
import librosa.display

# Generate a test audio signal (sine wave + harmonics)
sr = 22050  # sample rate
duration = 2.0  # seconds
t = np.linspace(0, duration, int(sr * duration), endpoint=False)
f_start = 0
f_end = 20000
audio = 0.5 * np.sin(2 * np.pi * ((f_start + (f_end - f_start) * t / duration) * t))

fig, axs = plt.subplots(3, 2, figsize=(14, 12))
fig.suptitle('Audio Representations', fontsize=16)

# Waveform
librosa.display.waveshow(audio, sr=sr, ax=axs[0, 0])
axs[0, 0].set_title('Waveform')
axs[0, 0].set_xlabel('')
axs[0, 0].set_ylabel('Amplitude')

# Spectrogram
S = np.abs(librosa.stft(audio, n_fft=1024, hop_length=256))
librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max), sr=sr, hop_length=256, x_axis='time', y_axis='hz', ax=axs[0, 1])
axs[0, 1].set_title('Spectrogram (dB)')

# Mel Spectrogram
S_mel = librosa.feature.melspectrogram(y=audio, sr=sr, n_fft=1024, hop_length=256, n_mels=64)
librosa.display.specshow(librosa.power_to_db(S_mel, ref=np.max), sr=sr, hop_length=256, x_axis='time', y_axis='mel', ax=axs[1, 0])
axs[1, 0].set_title('Mel Spectrogram (dB)')

# MFCC
mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13, n_fft=1024, hop_length=256)
librosa.display.specshow(mfccs, x_axis='time', ax=axs[1, 1])
axs[1, 1].set_title('MFCC')

# CQT
C = np.abs(librosa.cqt(audio, sr=sr, hop_length=256, n_bins=60))
librosa.display.specshow(librosa.amplitude_to_db(C, ref=np.max), sr=sr, hop_length=256, x_axis='time', y_axis='cqt_note', ax=axs[2, 0])
axs[2, 0].set_title('CQT (dB)')

# Hide the last empty subplot
axs[2, 1].axis('off')

plt.tight_layout(rect=[0, 0, 1, 0.97])
plt.show()
<Figure size 1400x1200 with 6 Axes>

Symbolic Representations

Symbolic representations are structured, human- and machine-readable formats that encode musical information such as pitch, rhythm, dynamics, and articulation. These representations are essential for music analysis, composition, generation, and interoperability between software tools. Below are some of the most widely used symbolic formats:

MIDI

MIDI (Musical Instrument Digital Interface) is a standard protocol for communicating musical performance data between electronic instruments and computers. It encodes information such as note pitch, velocity (how hard a note is played), duration, instrument type, and control changes (e.g., modulation, sustain pedal). MIDI files do not contain actual audio but rather instructions for how music should be played, making them compact and widely compatible. MIDI is the backbone of most digital music production environments and is used for sequencing, editing, and playback.

Key features:

  • Encodes note events (on/off), pitch, velocity, and timing.
  • Supports multiple channels (instruments) and tracks.
  • Widely supported by DAWs, synthesizers, and notation software.

ABC Notation

ABC Notation is a text-based music notation system that uses ASCII characters to represent musical scores. It is especially popular for folk and traditional music due to its simplicity and ease of sharing via plain text. ABC notation can encode melody, rhythm, lyrics, and basic chords.

Key features:

  • Human-readable and easy to edit in any text editor.
  • Supports simple melodies, chords, and lyrics.
  • Many online tools exist for converting ABC to sheet music or MIDI.

REMI

REMI (REvamped MIDI-derived events) is an enhanced representation of MIDI data designed for deep learning-based music generation. REMI introduces additional event types such as Note Duration, Bar, Position, and Tempo, allowing for a more structured and musically meaningful encoding of rhythm and meter.

Key features:

  • Designed for symbolic music generation with neural networks.
  • Encodes timing, structure, and expressive elements.
  • Facilitates learning of musical form and rhythm.

MusicXML

MusicXML is an XML-based format for representing Western music notation. It encodes detailed musical elements such as notes, rests, articulations, dynamics, lyrics, and layout information. MusicXML is ideal for sharing, analyzing, and archiving sheet music, and is supported by most notation software.

Key features:

  • Encodes full sheet music, including layout and expressive markings.
  • Supports complex scores with multiple staves, voices, and instruments.
  • Enables interoperability between notation programs (e.g., Finale, Sibelius, MuseScore).

Piano Roll

The Piano Roll is a visual representation of music, where time is displayed on the horizontal axis and pitch on the vertical axis. Notes are shown as rectangles, with their position indicating onset, their length indicating duration, and their vertical placement indicating pitch. Piano rolls are commonly used in DAWs for editing MIDI data and visualizing performances.

Key features:

  • Intuitive, graphical interface for editing MIDI notes.
  • Useful for sequencing, quantization, and visualization.
  • Does not encode expressive markings or complex notation.

Note Graph

A Note Graph is a graph-based representation of musical scores, where nodes represent notes and edges capture relationships such as sequence, onset, and sustain. This approach provides a structured way to analyze and model complex musical relationships, such as polyphony, voice leading, and harmonic context.

Key features:

  • Captures relationships between notes beyond simple sequences.
  • Useful for music analysis, generation, and machine learning.
  • Enables modeling of complex structures like counterpoint and harmony.

Summary Table

RepresentationTypeTypical Use CasesStrengths
MIDIEvent-basedSequencing, playback, editingCompact, widely supported
ABC NotationText-basedFolk/traditional music, sharingSimple, human-readable
REMIEvent-basedAI music generationStructured, rhythm-aware
MusicXMLXML-basedSheet music, notation, analysisDetailed, expressive, interoperable
Piano RollVisualMIDI editing, sequencingIntuitive, easy to manipulate
Note GraphGraph-basedAnalysis, AI, complex relationshipsCaptures structure and context
Source
import music21
music21.environment.UserSettings()['musescoreDirectPNGPath'] = '/usr/bin/mscore3'

# Create a simple melody: C D E F G F E D C
melody_notes = ['C4', 'D4', 'E4', 'F4', 'G4', 'F4', 'E4', 'D4', 'C4']
melody = music21.stream.Stream()
for n in melody_notes:
    melody.append(music21.note.Note(n, quarterLength=0.5))

# Show the musical score (this will render in Jupyter if MuseScore or similar is installed)
melody.show()
<IPython.core.display.Image object>

Questions

  1. What is the difference between a tone and a note in music psychology?
  2. How does a spectrogram differ from a waveform in audio visualization?
  3. What is the purpose of the Mel Spectrogram and why is it perceptually relevant?
  4. Describe the concept of harmony and how technology can be used to analyze it.
  5. What are the main differences between symbolic representations such as MIDI, ABC Notation, and MusicXML?