2. Foundations of AI - Creative AI

Why this matters¶

If we are going to use generative AI seriously, we need a mental model of how it works. Not the equations — we can leave those to the specialists — but the cast of characters: data, models, training, inference, parameters, loss, generalisation, bias. With those words in place, the rest of the course makes sense.

By the end of this chapter you should be able to explain, in plain language and over coffee, what happens when somebody says “we trained a model on a billion images and now we are using it to generate logos”.

A toy starting point: learning from examples¶

Imagine you want to teach a computer to tell cats from dogs in photographs. You have two strategies:

Write rules by hand. “If the ears are pointy and the snout is narrow…” — this is the symbolic AI of the mid-20th century. It works for small problems and breaks for anything visual at the level of the real world.
Show it many examples of cats and dogs labelled as such, and let it learn the rule itself. This is machine learning (ML).

A simple diagram showing data flowing into a model with parameters, producing a prediction, with a loss feedback loop into the parameters. — A simple machine-learning pipeline: **data** flows into a **model** with **parameters**; the model produces a **prediction**; a **loss** measures how wrong the prediction is and is used to update the parameters.

The diagram above is essentially the whole field. The specifics — what the data looks like, what shape the model takes, how the loss is measured — change wildly between applications. But the loop is always the same.

The cast of characters¶

Data¶

Data is the fuel. Every AI model you will use in this course was trained on a dataset:

Images — public photography (e.g. LAION-5B collections), licensed stock libraries, scraped web galleries.
Text — books, Wikipedia, Common Crawl scrapes of the open web, scientific papers, code repositories.
Audio — music libraries, podcasts, speech corpora, YouTube transcripts.
Video — public video platforms, licensed footage libraries, motion-capture archives.

The quality, content, and consent of training data shape the model in deep ways. We will return to this in chapter 11.

Models¶

A model is a function with parameters. In modern AI, this function is a neural network — a layered chain of multiplications, additions, and non-linear “squashing” operations. The parameters are the numbers (often billions of them) that determine exactly which input produces which output.

You do not need to know the inner workings to use a model, just as you do not need to know how a violin is made to play one. But two facts matter:

The function is differentiable. This means we can compute, for each parameter, which direction would make the prediction slightly better. That is what makes training possible.
The function is huge. Modern models have between a few hundred million and a few trillion parameters. The model file alone can be tens of gigabytes.

Training¶

Training is the process of repeatedly:

taking a batch of examples from the dataset,
making predictions with the current parameters,
measuring the loss (how wrong the predictions were),
nudging the parameters in the direction that reduces the loss.

This loop runs billions of times for a large model. It typically takes days or weeks on hundreds of high-end GPUs and consumes enormous amounts of electricity. Strubell et al. estimated the carbon cost of training a large NLP model already in 2019; the numbers have grown since Strubell et al., 2019.

The result of training is a trained model: the network plus its specific set of parameter values.

Inference¶

Inference is what happens when you use a trained model. You feed it an input (a prompt, an image, an audio clip), and it produces an output. Inference is far cheaper than training — a few cents instead of a few million dollars — but at the scale of hundreds of millions of users it still adds up.

When you type into ChatGPT, you are running inference on a previously trained model. The model’s parameters do not change.

Generalisation¶

We want a model that does well on new examples it has never seen. This is generalisation. The opposite is overfitting — the model has memorised the training examples and does poorly on anything else.

Generalisation is the reason a face-recognition model can recognise a face it never saw during training. It is also the reason a language model can write a paragraph about a topic that did not exist in its training set. (And, as we will see, the reason such a paragraph can be subtly wrong.)

Bias¶

Models inherit the biases of their data. If a dataset over-represents English-speaking, Western, web-published, well-photographed material, the model will be best at exactly that material. This is not a bug; it is a property of how machine learning works Bender et al., 2021Crawford, 2021. We will return to bias as an ethical and practical question in chapter 11, but it is also relevant to everyday creative use.

A very short tour of neural networks¶

A neural network is a stack of layers. Each layer takes a list of numbers, multiplies them by another list of numbers (the parameters), adds, and passes the result through a simple non-linear function. Stacked many times, this becomes flexible enough to model very complex patterns Goodfellow et al., 2016.

A few common families you will hear about:

Convolutional neural networks (CNNs) — used for images for most of the 2010s.
Recurrent neural networks (RNNs) — used for sequences (text, audio) until around 2018.
Transformers — the dominant architecture today Vaswani et al., 2017. Powers most chat models, image models, and the new generation of audio and video models.
Diffusion models — a training procedure layered on top of a neural network, particularly successful for images Ho et al., 2020. See chapter 5.

The specific architecture matters for performance, but the cast of characters above stays the same.

Foundation models and fine-tuning¶

Most state-of-the-art creative AI tools today are foundation models: very large models trained once, then adapted to many tasks. Adaptation can take several forms:

Prompting — putting the right text in front of the model at inference time. No retraining needed.
Fine-tuning — taking a foundation model and continuing to train it briefly on a smaller, more focused dataset. Costs vary from a few dollars (LoRA on a personal GPU) to millions (full fine-tunes of large models).
RLHF / RLAIF — reinforcement learning from human or AI feedback. Used to align chat models to be helpful and polite.
Distillation — training a smaller “student” model to mimic a larger “teacher”.

For most of this course, you will be prompting existing foundation models. In chapter 4 we look at what prompting really is.

This week’s lab: Reflect, Explore, Create¶

Reflect (≈ 30 min, in lab + your weekly log)¶

Pick one prompt and write 150–300 words in your weekly log:

List three categories of work in your discipline where AI models are plausibly useful, and three where you suspect they are not. What is the difference?
The training data of large models is mostly English, mostly Western, and mostly from the open web. How might that show up in a model’s outputs for your discipline or your native language?
Strubell et al. estimated training a large NLP model could emit as much CO₂ as five cars over their lifetimes Strubell et al., 2019. How does that change (or not) how you feel about using these tools?

Explore (≈ 60 min, in lab) — inspect a model card¶

Visit Hugging Face and search for stable-diffusion (or another model that interests you).
Pick one model card and read it from top to bottom. Find:
- Which dataset was the model trained on?
- How many parameters does the model have?
- What is the licence?
- What are the known limitations?
Write down three things from the card that you did not know before, and one thing you did not understand. Bring the “one thing you did not understand” to next week’s lecture.

Create (≈ 30 min, optional code track) — a 10-line training loop¶

This is for students who want to feel a training loop in their fingers. It is not required, but it is the most direct way to make a model rather than just use one — and that is the heart of the Create track. If you do not want to code, instead write a 200-word reverse model card for an imagined Norwegian-language image model: what would you put on the card, and why?

We will train a tiny model to fit a curve. Open a notebook and run:

import numpy as np

rng = np.random.default_rng(0)
x = rng.uniform(-1, 1, size=200)
y = 2 * x + 0.5 + rng.normal(0, 0.1, size=200)

# parameters
w, b = 0.0, 0.0
lr = 0.05

for step in range(200):
    pred = w * x + b
    loss = ((pred - y) ** 2).mean()
    grad_w = ((pred - y) * x).mean() * 2
    grad_b = (pred - y).mean() * 2
    w -= lr * grad_w
    b -= lr * grad_b

print(f"w = {w:.3f}, b = {b:.3f}, loss = {loss:.4f}")

You should get $w \approx 2$ and $b \approx 0.5$ . That is a model with two parameters, trained from scratch, in 12 lines. Every “AI” you will use this semester is the same loop scaled up by a factor of ten billion. Commit the notebook (or the reverse model card) to your portfolio.

Going further¶

Melanie Mitchell, Artificial Intelligence: A Guide for Thinking Humans Mitchell, 2019 — the most accessible book-length introduction to the technical side of the field; ideal companion to this chapter.
Goodfellow, Bengio, Courville — Deep Learning Goodfellow et al., 2016 — free online textbook.
3Blue1Brown — Neural Networks video series Sanderson, 2017 — best visual explanation of backpropagation.
The Hugging Face course Hugging Face, 2024 — free, code-first, beginner-friendly.
Crawford, Atlas of AI Crawford, 2021, chapter 1, on what is in the data.

References¶

Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 3645–3650. 10.18653/v1/P19-1355
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT). 10.1145/3442188.3445922
Crawford, K. (2021). Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence. Yale University Press. https://yalebooks.yale.edu/book/9780300264630/atlas-of-ai/
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1706.03762
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2006.11239
Mitchell, M. (2019). Artificial Intelligence: A Guide for Thinking Humans. Farrar, Straus. https://melaniemitchell.me/aibook/
Sanderson, G. (2017). Neural Networks. 3Blue1Brown. https://www.3blue1brown.com/topics/neural-networks
Hugging Face. (2024). The Hugging Face Course: Transformers, Diffusers, and LLMs. Hugging Face. https://huggingface.co/learn