3. Generative AI - Creative AI

Why this matters¶

For most of its history, machine learning was used to classify things: is this email spam, is this image a cat, does this MRI scan show a tumour? Generative AI flips the question around: instead of labelling an existing thing, the model produces a new thing.

This change is small in mathematics and enormous in practice. It is also the change that took machine learning from a back-office technology into something people use in the foreground, every day. This chapter introduces the vocabulary you will need for the rest of the book.

From classifying to generating¶

Suppose we have a dataset of pictures of cats. A classifier learns a function

“given an image, output 1 if it is a cat, 0 otherwise.”

A generator learns a different function

“produce an image that looks like the cats you have seen.”

The classifier learns about a boundary (cat / not-cat). The generator learns about a distribution (the space of plausible cat pictures). The distribution is much harder to learn — but once you have it, you can sample from it and get pictures that did not exist before.

This is the single mental shift behind everything in this book.

Probability without tears¶

You do not need formal probability for this course, but two ideas help.

Distributions¶

A distribution is a way of saying how likely each possible thing is. For example, the heights of UiO students form a distribution: 1.70 m is more likely than 2.20 m. A generative image model implicitly learns a distribution over images. A generative language model learns a distribution over sequences of words.

The catch: the space of all possible images is unimaginably large. A single $512 \times 512$ colour image has 786 432 numbers in it. The set of plausible images is a vanishingly thin slice of that vast space. Learning to navigate that slice is what diffusion models, GANs, and transformers do.

Sampling¶

To sample is to draw a single concrete thing from a distribution. Every time you press “generate” in an image tool, the model is sampling from the distribution it learned. Press the button twice and you get two different samples — usually similar in style, never identical.

Samplers have parameters:

Temperature (text and audio models) — high temperature flattens the distribution, making the output more varied and weird. Low temperature sharpens it, making the output safer and more predictable.
Top-k / top-p (text models) — restrict the model to the top k or top p% of next-token candidates. Stops it from sampling extremely rare nonsense.
CFG / guidance scale (image and video models) — strengthens the influence of the conditioning prompt. Too high and the result becomes oversaturated and rigid; too low and it ignores the prompt.
Steps (diffusion models) — how many denoising steps to take. More steps generally means sharper output, up to a point.

The same prompt with different sampler settings can produce dramatically different results. This is one of the most useful things to internalise this semester.

Conditioning: telling the model what you want¶

If a generator just samples from its full distribution, you get something randomly cat-shaped. Useful, but not creative work. The technical word for steering the generator is conditioning.

Conditioning is anything that goes into the model in addition to noise to bias what it produces.

A text prompt is the most common form of conditioning.
A reference image can also condition the generation (“make it look like this”).
A mask can condition where to change (“only edit this part of the picture”).
A control signal like a pose skeleton, edge map, or depth map can condition shape and composition (this is what tools like ControlNet do, see chapter 5).
An audio waveform or a piece of MIDI can condition a music model.
A previous frame can condition the next frame in a video model.

Once you start looking, you will see conditioning everywhere. The user interface of every generative tool is essentially a conditioning console.

Diagram showing different conditioning inputs (text, image, mask, control) feeding into a single generative model. — Multiple ways to condition a single generative model. The model itself is shared; the inputs change.

Prompts as the new interface¶

For a generation of users who have never seen a command line, prompts are the new interface. A prompt is just text, but writing a good one is now its own small craft.

A few principles that hold across most chat and image tools:

Be specific about subject, style, context, and constraints.
Show, don’t only tell — quote a sentence, paste an example, attach an image.
Iterate — generation is cheap. Treat the first output as a draft.
Constrain the form, not only the content — “in three sentences”, “as a bulleted list”, “as a poster in 1:2 ratio”.
Inspect failures — when the output is wrong, write down how it is wrong. That is often more informative than your initial prompt.

We dedicate chapter 4 to prompts for language models. For now, the take-away is that the prompt is the interface, and that interface is text. (Even when it includes images, you usually still describe what to do in words.)

A taxonomy of generative models¶

You will hear many architecture names in the wild. The four big families are:

GANs (Generative Adversarial Networks) Goodfellow et al., 2014 — train two networks against each other: a generator tries to produce realistic samples, a discriminator tries to tell them apart from real data. Dominant for images 2014–2020, now mostly retired.
Autoregressive models — predict the next token given the previous ones. Powers all chat models (next word) and many audio and image models. Underlying maths: factorise the joint distribution as a product of conditionals.
Diffusion models Ho et al., 2020Rombach et al., 2022 — learn to denoise. Starting from random noise, the model iteratively removes a bit of noise to reveal a coherent image, audio, or video. Dominant for high-quality images, video, and (increasingly) audio.
Flow matching / rectified flows Esser et al., 2024 — close cousin of diffusion models with simpler training. Powers some of the latest image and video models.

You do not need to memorise this. But when a tool brags about being “GAN-based” or “diffusion-based” or “flow-based”, you should be able to nod and understand roughly what that implies.

Five paradoxes of working with generative AI¶

We have just covered how the system works — distributions, sampling, conditioning, prompts, architectures. Before we walk into the medium-specific chapters that follow, it is worth pausing on how to work with the system.

The most useful conceptual handle here comes from Salma, Hijón-Neira, and Pizarro’s 2025 paper Salma et al., 2025, which argues that today’s generative tools (Midjourney, Copilot, ChatGPT, and their cousins) almost all operate as executors — they take a command and produce an output. Creative work, by contrast, is non-linear, iterative, and ambiguous. That mismatch produces five irreducible paradoxes in human–AI co-creation:

Paradox	Core tension	Practical question for you
Ambiguity vs. precision	Your creative intent is vague; the model needs precise input.	How do you translate a vision into a prompt without prematurely closing the exploration?
Control vs. serendipity	You want to steer; the most interesting outputs are the unexpected ones.	How do you stay open to “happy accidents” while keeping authorship?
Speed vs. reflection	The model generates in seconds; understanding takes minutes.	Where will you build in pauses, friction, and re-reading?
Individual vs. collective	Your voice is unique; the model is trained on the average of millions of voices.	How do you keep your signature when your collaborator is “the wisdom of the crowd”?
Originality vs. remix	Generative AI is an extreme remix engine; you also want work that is yours.	Where does the novelty come from — the model, the prompt, the edits, the curation?

The Salma paper’s key move is to argue that these are not bugs to fix; they are tensions to manage. A good Creative AI practice does not try to resolve them. It learns to live inside them, deliberately leaning to one side when the project needs it, and to the other side when it needs the opposite.

A worked walk-through:

Ambiguity vs. precision. A useful early move is to over-describe in plain language (“I want this to feel like an early-morning Oslo trikk window seen from a damp coat”) and then ask the model to translate your description into a prompt it can actually work with. You keep the ambiguity in your head; the model converts it to precision on the canvas. The interface you want is a multi-turn refinement loop, not a single big input box.
Control vs. serendipity. Always generate at least three variations of anything you care about. The one you would not have chosen on paper is often the one that does the most work for you. Veto power on suggestions is your most important creative skill in this mode — and “veto” includes the active “the AI is wrong here, but the wrongness gives me an idea” move.
Speed vs. reflection. The model can generate a song in 90 seconds; spend at least 10 minutes listening to it before you ask for the next one. Build pause points into your workflow. The whole risk of generative AI in education and in professional practice is attentional deskilling — losing the habit of looking long.
Individual vs. collective. Watch for the “average” pull. If every image of “a Norwegian fjord” the model gives you looks like the same tourism poster, you are getting the collective; your work needs an explicit push the other way (a reference image, a style word, a constraint, an act of refusal).
Originality vs. remix. Accept, honestly, that the model is remixing prior work. Your originality lives in the brief, the prompt, the selection, the edits, and the context you put around the output. This is the central argument of chapter 11: in a culture saturated with remix, the centre of gravity of authorship shifts from making to directing and curating.

These five paradoxes will quietly structure the rest of the book. Every applied chapter (image, sound, video, code, 3D, agents) re-runs the same five tensions in a different medium. By the end of the course you should be able to name which paradox you are wrestling with in any given lab session — and a good Surprise / Will process memo (intro) will usually be a story about exactly one of them.

What generative models cannot do¶

A short list, useful to keep in mind:

They do not know what is true. A text model can produce a perfectly written paragraph that is factually wrong. An image model can produce a person with six fingers. We will return to this throughout the course.
They generalise from training data, not from causal understanding. They have no model of physics, no model of social cause, no model of why things happen.
They are not deterministic at default settings. Same prompt, different output.
They cannot remember you across sessions unless they are explicitly designed to (with retrieval, memory, or fine-tuning).
They are bounded by their training cutoff. Anything newer than the cutoff is invisible to them unless they can search the web or read uploaded files.

This week’s lab: Reflect, Explore, Create¶

Reflect (≈ 30 min, in lab + your weekly log)¶

Pick one prompt and write 150–300 words in your weekly log:

Describe in plain language the difference between learning a distribution and learning a boundary. Why does that matter for creative work?
Tools like Midjourney expose only a few sampler parameters; tools like ComfyUI expose dozens. Whom does each design serve?
The same model, same prompt, two clicks: two different images. Is this a feature or a bug for your use case?
Re-read the Five paradoxes above and pick the one that bites hardest for your own discipline. Where does it bite, and what does that tell you about the kind of co-creative setup you would want?

Explore (≈ 60 min, in lab)¶

Same prompt, three samplers.

Pick one image tool you have access to.
Write a prompt that has a clear style and subject — for example, “a watercolour illustration of a fox reading a book in a library, soft lighting, warm tones”.
Generate the image with three different settings: default, low CFG / guidance (or low temperature), high CFG / guidance (or high temperature).
Paste the three results side by side and describe in two sentences each what changed.

Conditioning beyond text.

Generate an image you like.
Use the same tool’s image-to-image mode (or style reference) to generate a variation with the same overall composition.
Write down how strict the conditioning was — what is preserved, what changes?

Optional code track — sampling from a tiny language model.

from transformers import pipeline
gen = pipeline("text-generation", model="gpt2")
for t in [0.2, 0.7, 1.2]:
    out = gen("Once upon a time in Oslo,", max_new_tokens=40, temperature=t, do_sample=True)
    print(f"--- t={t} ---")
    print(out[0]["generated_text"])

GPT-2 is small and quaint by 2026 standards but illustrates the temperature knob nicely.

Create (≈ 30 min, in lab + carry-over to your portfolio)¶

From the Explore experiments, compose a triptych for your portfolio: three panels arranged side by side, each captioned with the one parameter (sampler, CFG, temperature, conditioning image) that produced it. Add a one-paragraph artist’s statement (~150 words) framing the triptych as a small piece of work, not just a screenshot grid. The goal is to push from “I ran the experiments” to “I made this, and here is why”.

Going further¶

Lilian Weng, What are diffusion models? Weng, 2021 — a careful, illustrated technical introduction.
Andrej Karpathy, The unreasonable effectiveness of recurrent neural networks Karpathy, 2015 — old but a beautifully written introduction to “next-token” generation.
Hugging Face — Diffusers documentation Hugging Face, 2024.

References¶

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative Adversarial Nets. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1406.2661
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2006.11239
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/2112.10752
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., & Rombach, R. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. https://arxiv.org/abs/2403.03206
Salma, Z., Hijón-Neira, R., & Pizarro, C. (2025). Designing Co-Creative Systems: Five Paradoxes in Human-AI Collaboration. Information, 16(10), 909. 10.3390/info16100909
Weng, L. (2021). What Are Diffusion Models? https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Karpathy, A. (2015). The Unreasonable Effectiveness of Recurrent Neural Networks. https://karpathy.github.io/2015/05/21/rnn-effectiveness/
Hugging Face. (2024). Diffusers — State-of-the-art Diffusion Models for Image, Video, and Audio Generation. Hugging Face. https://huggingface.co/docs/diffusers/index