Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

6. AI and sound

Speech, music, and sound design with generative audio models

University of Oslo

Why this matters

Sound used to be the laggard medium for generative AI. Images and text arrived in 2022; high-quality audio took another year or two to catch up. By 2024 we had usable text-to-music systems and convincing voice cloning; by 2026 these are everywhere from podcast workflows to film production to Eurovision. This week we look at what these systems can actually do, and where they still fail.

The audio chapter is also where the question of consent becomes most personal: voices are tied to bodies, and audio cloning of a real person can be used to harm.

Three quick families of audio AI

You will meet at least three quite different things under the umbrella of “AI and sound”:

  1. Text-to-speech (TTS) and voice cloning. Synthetic voices reading text. Quality has been transformed by neural vocoders and now by transformer-based systems. Tools: ElevenLabs, Resemble, OpenAI TTS, Microsoft VALL-E, Norwegian-language efforts like the National Library’s NB-Whisper on the recognition side.
  2. Music generation. Systems that produce full pieces of music from text prompts. Tools: Suno, Udio, Stable Audio, MusicLM in research form Agostinelli et al., 2023.
  3. Sound effects and design. Generative SFX for film/game pipelines: footsteps on different surfaces, ambient atmospheres, Foley.

Underneath all of them sit the same core ideas as image generation: a model trained on huge amounts of audio learns to predict, and at inference time produces, audio that resembles its training distribution.

A schematic spectrogram with frequency on the y-axis and time on the x-axis.

A spectrogram represents sound as an image (time on x, frequency on y). Many audio models treat sound generation as image generation on a spectrogram.

How models represent sound

Computers store sound as a sequence of numbers — typically 44 100 or 48 000 of them per second per channel. That is a lot. Generating sound directly sample by sample (as the first WaveNet model did in 2016) was beautiful but slow.

Modern audio models almost always work on a compressed representation:

  • A spectrogram — a 2D image of the sound (time × frequency). Treat it as an image, diffuse over it, then convert back to audio with a vocoder.
  • A discrete code from a neural audio codec (like EnCodec, SoundStream, DAC). The model produces a short sequence of codes; the codec decodes them back to audio.

Both approaches make the problem 50–100× smaller than working on raw samples. This is why your laptop can now produce a song in seconds.

The architectural backbone is again usually a transformer Vaswani et al., 2017, adapted to handle the very long sequences that audio implies. Diffusion models on spectrograms or codec tokens are common Agostinelli et al., 2023.

Text-to-speech and voice cloning

A modern TTS system takes:

  • A piece of text (or phonemes), and
  • a speaker embedding describing the desired voice, possibly extracted from a short reference recording (5–30 seconds is often enough),

and produces a waveform.

What works well in 2026:

  • Convincing prosody in English and most major European languages, including Norwegian (with the right model).
  • Cloning a specific voice from a short reference. Worryingly well.
  • Emotion and style control via prompts (“sad, slow”), tags, or a reference clip.
  • Multilingual speakers — one voice speaking many languages without re-recording.

What still struggles:

  • Long context coherence — a 20-minute audiobook can drift in tone.
  • Sung speech / song-speech mixes in TTS engines (separate music systems do better).
  • Code-switching mid-sentence, especially with code or technical terms.
  • Low-resource languages — Sami, Faroese, many African and Asian languages still get poor results.

You can clone a voice in two minutes from a YouTube clip. This is a fact, not a recommendation. Cloning a real person’s voice without their consent is harmful and, increasingly, illegal — for fraud, for harassment, for impersonation. Treat voice cloning as you would treat using somebody’s face: with explicit permission, attribution where appropriate, and a clear use case.

The EU AI Act European Parliament,Council of the European Union, 2024 places certain forms of deepfake audio in the higher-risk categories with transparency obligations.

Music generation

Music generation is harder than speech because:

  • The structure is longer-range — verses, choruses, build-ups, drops.
  • The judgement is aesthetic — wrong notes are not “wrong” in the same way as wrong words.
  • The training data is contested — music is heavily copyrighted; using it for training has triggered lawsuits.

Despite all that, by 2026 tools like Suno and Udio will reliably produce 2–3 minute songs from a paragraph of prompt. Stems can often be separated for further editing in a DAW.

Useful prompt elements for music models:

  • Genre and era: “1970s funk”, “Norwegian black metal”, “modern indie folk”.
  • Instrumentation: “fingerpicked acoustic guitar, brushed snare, double bass”.
  • Tempo and feel: “BPM 110, swung eighths, intimate, late night”.
  • Lyrics, when supported, in their own field with [Verse]/[Chorus] tags.

What still fails:

  • Specific quotation of existing pieces. Asking for “in the style of Taylor Swift” raises legal and ethical alarms and is often blocked.
  • Coherent lyrics in languages other than English. Norwegian-language music output is improving but uneven.
  • Structural sophistication beyond pop forms — fugues, multi-section classical works, free-jazz dialogues.

Sound design and Foley

Beyond speech and music sits a quieter category: sound design. Models that produce 5–15 seconds of “rain on a tin roof”, “wooden cart on cobblestones”, or “alien ambience” are reshaping film, game, and podcast workflows. Tools include ElevenLabs sound effects, Stable Audio, AudioGen, and various open-source efforts.

This category is the unsung workhorse: less spectacular than song generation, less morally fraught than voice cloning, often the most immediately useful in production.

Where sound AI fits in a real workflow

Three observations:

  1. AI is rarely the whole pipeline. Generated audio gets imported into a DAW (Reaper, Ableton, Logic, Reason). It is layered, EQ-ed, mixed, and sometimes re-recorded.
  2. Stem separation has improved dramatically. Tools like Demucs and commercial offerings let you split a song into vocals, drums, bass, and other. This makes AI generation useful even when you cannot get clean stems out of it.
  3. Speech-to-text is the silent revolution. OpenAI’s Whisper and its Norwegian variant made transcription nearly free. Researchers, journalists, and podcasters use it daily. We will use it in the practice session.

This week’s lab: Reflect, Explore, Create

Reflect (≈ 30 min, in lab + your weekly log)

Pick one prompt and write 150–300 words in your weekly log:

  1. Voice cloning is now essentially a free service. What changes for journalism? For political ads? For your own digital footprint?
  2. Listen carefully to a generated 30-second clip. What gives it away? What does not give it away? Are you sure?
  3. Music generation models were trained on existing music. Imagine you are a working musician — how do you feel about that? Imagine you are a film composer working on a small project — how does that change your answer?

Explore (≈ 45 min, in lab) — transcribe and remix

  1. Record (or pick) a 30–60 second voice clip with your permission to use.
  2. Transcribe it with Whisper (via the web, via a local install, or via a service).
  3. Generate a new voice reading the same text in a TTS tool.
  4. Compare the two side by side. What did the AI catch? What did it miss? What does the synthetic voice add or remove? Document the comparison in your log.

Optional code track. Install openai-whisper or nb-whisper locally:

pip install openai-whisper
whisper my-clip.mp3 --model small --language Norwegian

Create (≈ 60 min, in lab + carry-over to your portfolio) — a 30-second piece

  1. Pick a small brief — for example: “background music for a UiO research lab promo video, 30 seconds”.
  2. Generate a song in a music tool. Iterate prompts until you have something usable.
  3. Generate a separate ambient sound effect bed.
  4. Mix the two (any DAW; even Audacity works).
  5. Document: tools, prompts, edits, time taken. Add a consent and provenance note — for any voice or sample used, where did it come from and do you have the right to publish it?
  6. Export and commit the final mix to your portfolio.

For audio generation in code, the audiocraft library by Meta lets you run MusicGen locally on a moderate GPU.

Going further

  • Engel et al., Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders Engel et al., 2017 — early but influential paper from Google’s Magenta team.
  • Agostinelli et al., MusicLM: Generating Music From Text Agostinelli et al., 2023 — research paper from Google.
  • The Spawning Coalition Spawning, 2024 — Holly Herndon’s and others’ movement around opting voices and likeness out of training datasets.
  • The NB-Whisper project Norwegian National Library, 2024 at the NB AI Lab Norwegian National Library, 2024 — a great example of a low-resource-language AI effort at the National Library of Norway.
References
  1. Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., & Frank, C. (2023). MusicLM: Generating Music From Text. https://arxiv.org/abs/2301.11325
  2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1706.03762
  3. Regulation (EU) 2024/1689 — The AI Act. (2024). European Parliament. https://eur-lex.europa.eu/eli/reg/2024/1689/oj
  4. Engel, J., Resnick, C., Roberts, A., Dieleman, S., Eck, D., Simonyan, K., & Norouzi, M. (2017). Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders. https://arxiv.org/abs/1704.01279
  5. Spawning. (2024). Spawning — Opt-out and Consent Tools for AI Training Data. Spawning Inc. https://spawning.ai/
  6. Norwegian National Library. (2024). NB-Whisper: Norwegian Speech Recognition Models. Nasjonalbiblioteket — NB AI Lab. https://huggingface.co/NbAiLab/nb-whisper-large
  7. Norwegian National Library. (2024). NB AI Lab — AI Research at the National Library of Norway. Nasjonalbiblioteket. https://ai.nb.no/