10. Multimodal and agentic AI

Why this matters¶

For most of this book we have looked at AI as a system that takes a prompt and produces an artefact. By 2026 a quieter revolution has been changing what AI is:

Multimodal models see images and hear sound as easily as they read text.
Agentic systems chain models together, take actions in the world, and complete multi-step tasks autonomously.

These two ideas, taken together, mean that the creative AI of the late 2020s looks less like a slot machine and more like a collaborator that can read your screen, look at your sketch, listen to your voice memo, write code, and pay for an API.

This week we look at the architecture and the ethics of that shift.

Multimodal models¶

Until around 2023, generative AI tools were medium-specific: ChatGPT did text, Midjourney did images, MusicLM did music. Modern models are increasingly multimodal: a single architecture handles many media in and many media out.

A diagram showing several modalities (text, image, audio) flowing into a shared model and out again as new modalities. — A schematic of a multimodal model: many modalities in, many modalities out, with a single shared “thinking” space in the middle.

The technical recipe is roughly:

Encode each modality (text, image, audio) into a sequence of vectors using a small modality-specific encoder.
Pass them through a shared transformer Vaswani et al., 2017 that does not care which modality they came from.
Decode to whichever modality is wanted on the way out.

Concrete consequences for creative work:

You can paste an image and ask a question about it. Useful for design critique, architecture review, art history.
You can hand the model a screenshot and ask it to fix the UI.
You can play it a 30-second clip and ask for the genre, tempo, and emotional tone.
You can talk to it like a phone call — Voice Mode in major chat products.
You can give it a sketch and ask for a polished version, or a polished image and ask for a structural sketch.

This unlocks a different kind of prompt: show, don’t tell. The most useful prompt is often “here is what I am working with — here is what I want — please help.”

Agentic AI¶

The shift from multimodal models to agents tracks a deeper conceptual shift that Salma, Hijón-Neira, and Pizarro flag in Salma et al., 2025: from AI as a passive executor of commands to AI as an active collaborator in the process. An agent does not just respond to a prompt; it plans, takes steps, observes results, and adjusts — much closer to how a human collaborator behaves on a brief.

An AI agent is a system that is given a goal and decides for itself which steps to take. The minimum architecture is:

An LLM with tools it can call — web search, code execution, file system, browser, APIs.
A loop: the model takes a step, observes the result, plans the next step, repeats.
A stopping criterion: the goal is met, the user intervenes, or a budget is exhausted.

Agents have existed in research since the 1970s. The current generation works because the underlying LLMs are good enough to plan and to recover from errors. By 2026, agentic tools include Anthropic’s Claude Code, OpenAI’s Operator, Cursor’s agent mode, and many specialised ones for sales, support, research, and creative pipelines.

For creative work, the most useful agents are:

Coding agents — implement a small feature across multiple files (chapter 8).
Research agents — plan a literature search, summarise the findings, draft a memo.
Production agents — wire together image, audio, and video tools into a single pipeline.

The hardest part is supervision: agents can do enormous amounts of work, much of it wrong. The discipline is to give an agent narrow, cheap-to-verify tasks.

A creative pipeline as an agent¶

Imagine you want to produce a 60-second animated short film. The agent’s plan might look like:

Story generation. Ask an LLM for three story outlines on the brief. Pick one.
Storyboard. Generate 12 storyboard panels with an image model (chapter 5).
Animation. For each storyboard panel, run image-to-video with motion prompts (chapter 7).
Voice-over. Generate the voice-over with TTS (chapter 6).
Music. Generate the score with a music model (chapter 6).
SFX. Generate Foley and ambience.
Assembly. Open a video editor, place the clips, sync the voice-over, mix the audio.
Render. Export the final film.

Each step is doable by a separate, focused tool. Stitching them together used to be a producer’s job; in 2026 that producer is increasingly an agent.

This is not the same as “AI replaces a film crew”. It is more like “an apprentice now does the wiring while the director directs.”

What agents are not¶

They are not magical. Most fail at multi-step tasks more than 50% of the time without careful supervision.
They are not cheap at scale. A single agent run on a non-trivial task can cost USD/EUR 1–10 in API fees. Multiplied by users, this adds up.
They are not safe by default. An agent with a browser can pay for things, send messages, sign up to services. Read the permissions before you press go.
They do not have opinions. They have outputs. Pretending otherwise is a category error.

Where this is going¶

By the late 2020s, the centre of gravity of generative AI is moving from “produce one artefact from one prompt” to “complete one task using many tools over time.” That is a much bigger change than it first appears, because it changes who is doing the work (still you, but as a director) and where the value sits (in the brief, the supervision, and the taste).

For students this means a single piece of practical advice: become very good at writing briefs. A brief is a precise, ambitious, well-bounded request. Models cannot read your mind; agents cannot guess what you really want. Whoever can write a great brief, in 2026, can ship work that used to require a team.

This week’s lab: Reflect, Explore, Create¶

Reflect (≈ 30 min, in lab + your weekly log)¶

Pick one prompt and write 150–300 words in your weekly log:

In your discipline, what is a brief? Who writes it now, and who would write it if AI did the rest of the work?
Imagine an agent that watches all your work and proactively suggests “next steps” in the background. Is that a tool or a colleague? Should you pay it?
The most successful agents in 2026 are also the ones with the most permissions (browsers, payment, code execution). What is the price of that?

Explore (≈ 30 min, in lab) — a multimodal conversation¶

Take a photograph of something messy or interesting in your daily life — your desk, a Tupperware drawer, a chord diagram from a music book.
Upload it to a multimodal chat model. Ask it three increasingly specific questions about the image.
Pivot: ask it to redesign what is in the image (a tidied desk, a different kind of chord). Note where its design knowledge ends and its general pattern-matching begins.

Create (≈ 60 min, in lab + carry-over to your portfolio) — design an agent¶

You will not necessarily implement an agent today. Instead, design one thoroughly enough that you could.

Pick a small creative goal you care about (a website portfolio, a podcast episode, a comic page, a recipe book, a teaching resource).
List every single step an agent would take to complete it. Draw the flow as a diagram (whiteboard, paper, Excalidraw, draw.io).
For each step, write down: which model, what input, what output, how you would verify the output, and what would happen if it failed. Add an explicit human in the loop checkpoint where one is needed.
Stop. Look at the diagram. Which steps are you happy to delegate, and which would you keep manual? Annotate the diagram with these answers.
Commit the diagram + a short README to your portfolio.

Optional code track. Use the OpenAI Agents SDK or LangGraph to build a tiny three-step agent that implements one slice of your design — e.g., search for a paper, summarise it, write a one-sentence headline. Don’t expect magic — expect a useful sketch.

Going further¶

Anthropic, Claude Code documentation Anthropic, 2024 — readable, opinionated take on coding agents.
OpenAI, Introducing Operator OpenAI, 2025 — a browser-using agent product page.
Andrej Karpathy, Software 2.0 / 3.0 talks Karpathy, 2024 — programming-with-prompts as a paradigm shift.
The AutoGPT Significant Gravitas, 2023 and BabyAGI Nakajima, 2023 historical projects — for a sense of where this all started in 2023.

References¶

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1706.03762
Salma, Z., Hijón-Neira, R., & Pizarro, C. (2025). Designing Co-Creative Systems: Five Paradoxes in Human-AI Collaboration. Information, 16(10), 909. 10.3390/info16100909
Anthropic. (2024). Claude Code Documentation. Anthropic. https://docs.anthropic.com/claude-code
OpenAI. (2025). Introducing Operator. OpenAI. https://openai.com/index/introducing-operator/
Karpathy, A. (2024). Software 2.0 and 3.0. https://karpathy.ai/zero-to-hero.html
Significant Gravitas. (2023). AutoGPT: An Autonomous GPT-4 Experiment. GitHub. https://github.com/Significant-Gravitas/AutoGPT
Nakajima, Y. (2023). BabyAGI: An Autonomous Task-Driven AI Agent. GitHub. https://github.com/yoheinakajima/babyagi