Why this matters¶
Image is the medium where Creative AI announced itself loudest. In 2022, when Stable Diffusion, DALL·E 2, and Midjourney arrived within a few months of each other, they changed picture-making faster than any tool since the smartphone camera. Designers, illustrators, journalists, lawyers, and the rest of us are still working out what that means.
This chapter introduces the technology underneath text-to-image models, the practical vocabulary of using them, and the editorial questions they raise.
How diffusion works (in pictures)¶
Modern image models are diffusion models Ho et al., 2020Rombach et al., 2022. The idea is simpler than it looks.
Forward diffusion (top): start from a real image and progressively add noise. Reverse diffusion (bottom): start from pure noise and let a neural network remove it step by step, conditioned on a text prompt.
The training procedure has two halves:
- Forward. Take a real image. Add a tiny bit of noise. Add a tiny bit more. Repeat many times until the image is pure static.
- Reverse. Train a neural network to undo one step of noise at a time. Given a noisy image and how noisy it is, predict what was added.
Once trained, you can run the reverse half from scratch: start with pure noise, denoise step by step, and a coherent image emerges. With conditioning (chapter 3), you can steer that emergence — most commonly with a text prompt, embedded by a model like CLIP Radford et al., 2021 and injected at every step.
A few practical consequences:
- The model can be reused for image-to-image by starting from a partially noisy version of an existing image. Less noise = more faithfulness to the input.
- The model can be reused for inpainting by only denoising the masked region.
- The model can be reused for outpainting by treating the canvas extension as a masked region.
- The number of steps trades quality for time. Many modern models can produce good results in 4–8 steps using distilled samplers.
Latent diffusion Rombach et al., 2022 adds a key efficiency trick: the diffusion does not happen in pixel space (millions of numbers per image) but in a compressed latent space (thousands of numbers). This is why models can run on a consumer laptop.
The most recent generation of image models (e.g. SD3, FLUX) uses flow matching or rectified flows rather than vanilla diffusion Esser et al., 2024. The intuition stays the same.
The vocabulary of text-to-image¶
When you use a text-to-image tool you will meet a small set of knobs.
- Prompt — what you want.
- Negative prompt — what you don’t want. Often more powerful than people expect.
- Aspect ratio — 1:1, 16:9, 9:16, 2:3, 3:2. Some compositions look fine in landscape and break in square.
- CFG scale / guidance — how strictly the model should obey the prompt. 5–9 is typical. Higher = more literal but oversaturated; lower = looser and more varied.
- Steps — number of denoising steps. 20–50 is typical for high quality; 4–8 for fast distilled models.
- Seed — the random seed for the noise. Same prompt + same seed = same image (within one model). Useful for iterating with a controlled variable.
- Sampler / scheduler — the algorithm that performs the denoising (Euler, DPM++, etc.). Affects style and convergence.
- Style / reference image — an extra conditioning input.
A simple discipline: change one knob at a time. If you change the prompt and the seed and the CFG, you cannot tell what caused the change in output.
Prompting for images¶
Image prompts are not the same as language prompts. A working pattern is:
[Subject] | [composition] | [style] | [medium] | [lighting] | [mood] | [extra refs]A worked example:
A wooden rowing boat moored at a fjord pier, viewed from a low angle,
black-and-white film photograph, soft early morning light, 50 mm,
nostalgic, in the style of late 20th-century Scandinavian photographySome practical tips:
- Be concrete about the subject before you reach for style. “A car” is vague; “a 1972 Volvo Amazon parked in front of a yellow wooden house” is something the model can grip.
- Style words matter. Material (“oil painting”, “pen drawing”), medium (“photograph”, “render”), and period (“1970s”, “Renaissance”) all do work.
- Avoid contradictions. “Photo-realistic illustration in watercolour” tells the model to choose between three things it cannot do at once.
- Iterate on what is wrong, not on the whole prompt. If the lighting is off, change only the lighting words.
Negative prompts¶
For models that support them, negative prompts are where you stop the failure modes you keep seeing: blurry, extra fingers, watermark, low quality, deformed face. Treat the negative prompt as a curated list, not a brain dump.
Reference images and ControlNet¶
If words run out, show. Most modern tools accept:
- a style reference — make it look like this,
- a structural reference — match this composition,
- a pose reference — use this pose,
- a depth or edge map — preserve this geometry.
The open-source community calls these ControlNets and stacks them freely; commercial products call them style reference or character reference.
This is where image generation stops being a slot machine and starts being a controllable creative tool.
Editing instead of generating¶
A common, often more useful workflow is image editing:
- Inpainting: mask a region, replace it with something else. “Replace the bottle in his hand with a coffee cup.”
- Outpainting: extend the canvas. “Add another metre of beach to the left.”
- Variation: same subject, slightly different.
- Upscaling: increase resolution while sharpening detail.
- Removing/adding specific objects with a single click.
You will get further with editing than with pure prompting for most professional jobs.
Where image models still struggle¶
- Hands, feet, text in pictures, jewellery, complicated logos.
- Faithful portraits of specific people — generally restricted by the major commercial tools.
- Multi-step composition. “A man holds a cat with one hand and points at a sign that says ‘Open’ with the other” is at the edge of what current models can compose reliably.
- Symbolic content. Tools struggle to put accurate text into images. (Improving — but not solved.)
- Diagrams and infographics. Image models can mimic the look of an infographic but rarely produce accurate data.
- Consistency across a series. Two pictures of “the same character” tend to drift. Solutions: character reference images, LoRA fine-tunes, image-to-image with seed locking.
This week’s lab: Reflect, Explore, Create¶
Reflect (≈ 30 min, in lab + your weekly log)¶
Pick one prompt and write 150–300 words in your weekly log:
- Compare an AI-generated image of “a typical Norwegian street” with a real photograph. What is the model averaging away?
- What changes when image-making moves from “ten minutes for a sketch” to “ten seconds for a finished-looking picture”? Who benefits, who loses?
- Read Hertzmann’s Can Computers Create Art? Hertzmann, 2018 alongside this chapter and respond to a single one of its claims using a generation you produced this week.
Explore (≈ 60 min, in lab)¶
A controlled experiment.
- Write one base prompt with a clear subject, composition, and style.
- Generate the image four times, each time changing exactly one variable:
- same prompt, same seed, four different aspect ratios;
- same prompt, same seed, four different CFG values (3, 6, 9, 12);
- same prompt, four different seeds;
- prompt unchanged except for the lighting words.
- Lay out the four grids in a single image. Caption each.
- Pick the one knob that mattered most for your subject. Write a short note on why.
Image-to-image.
- Take a photograph (or screenshot) you have rights to.
- Run it through an image-to-image pipeline with three different denoise strengths (0.3, 0.5, 0.8).
- Note where you sit on the fidelity ↔ freedom axis.
Optional code track. If you have a Hugging Face account, the diffusers library lets you generate images locally:
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/sd-turbo",
torch_dtype=torch.float16,
).to("cuda") # or "mps" on Mac, or "cpu"
img = pipe("a wooden rowing boat at sunrise on a fjord, watercolour",
num_inference_steps=4, guidance_scale=1.0).images[0]
img.save("boat.png")Create (≈ 30 min, in lab + carry-over to your portfolio)¶
Out of everything you generated above, assemble one finished piece for your portfolio. Choose one form:
- a four-image series with a shared subject and a clear conceptual through-line (e.g., the same Oslo street in four seasons);
- a single poster combining one of your generations with hand-set typography;
- a photo + AI redraw diptych that explicitly puts a real image next to its image-to-image variant.
Write a 100-word honest caption documenting: the tool, the seed (if visible), the prompt, the variations tried, and any human edits. The caption is part of the artefact, not metadata about it.
Going further¶
- Gatys, Ecker, Bethge, A Neural Algorithm of Artistic Style Gatys et al., 2015 — the 2015 paper that opened “neural style transfer”, a useful pre-diffusion ancestor for image-curious readers.
- Rombach et al., Latent Diffusion Models Rombach et al., 2022 — the founding paper of Stable Diffusion.
- Esser et al., Scaling Rectified Flow Transformers Esser et al., 2024 — the SD3 paper.
- The Hugging Face Diffusers documentation Hugging Face, 2024.
- Lev Manovich, AI Aesthetics Manovich, 2018 — short essays on what AI image-making looks like.
- Aaron Hertzmann, “Can Computers Create Art?” Hertzmann, 2018 — re-read this week with image generation in mind.
- For the legal side: Andersen v. Stability AI United States District Court, Northern District of California, 2023 and Getty Images v. Stability AI High Court of Justice (UK),US District Court, District of Delaware, 2023.
- Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2006.11239
- Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/2112.10752
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. International Conference on Machine Learning (ICML). https://arxiv.org/abs/2103.00020
- Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., & Rombach, R. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. https://arxiv.org/abs/2403.03206
- Hertzmann, A. (2018). Can Computers Create Art? Arts, 7(2), 18. 10.3390/arts7020018
- Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A Neural Algorithm of Artistic Style. https://arxiv.org/abs/1508.06576
- Hugging Face. (2024). Diffusers — State-of-the-art Diffusion Models for Image, Video, and Audio Generation. Hugging Face. https://huggingface.co/docs/diffusers/index
- Manovich, L. (2018). AI Aesthetics. Strelka Press. http://manovich.net/index.php/projects/ai-aesthetics
- Andersen v.\ Stability AI Ltd. (2023). United States District Court, Northern District of California. https://en.wikipedia.org/wiki/Andersen_v._Stability_AI
- Getty Images v.\ Stability AI. (2023). High Court of Justice (UK). https://en.wikipedia.org/wiki/Getty_Images_v._Stability_AI