4. AI and language - Creative AI

Why this matters¶

Of all the generative tools we will meet, large language models (LLMs) are the ones most students already use every day. They draft emails, summarise readings, fix code, explain concepts, brainstorm, translate, and help you cheat on assignments. (We will talk about that last one.)

This chapter is about what is actually happening when you type into ChatGPT, Claude, Gemini, Mistral, or a local model. It is also about how to use them well as a tool for writing and thinking — and how to spot when they are quietly making things up.

What is a language model?¶

A language model is a system trained to predict the next word (technically: the next token) given the previous words. That is it.

Given the input “The capital of Norway is”, the model assigns a probability to every possible next token. The probability of “Oslo” should be high; the probability of “purple” should be low.

To generate text, the model samples the next token, appends it to the input, and repeats:

The capital of Norway is ☐ → Oslo The capital of Norway is Oslo ☐ → . The capital of Norway is Oslo. ☐ → It

This is the same loop whether the model has 1 million parameters or 1 trillion. Modern LLMs are scaled-up versions of GPT-style models from 2018–2020 Brown et al., 2020, built on the transformer architecture Vaswani et al., 2017.

Tokens, not words¶

The model does not see words; it sees tokens. A token is usually a piece of a word — common words become one token, rare words become several. “Oslo” might be one token; “Jensenius” might be three. You can play with this in the OpenAI tokenizer or tiktokenizer.

Why does this matter?

Cost. Most commercial models bill per token. Long prompts and long answers cost more.
Context length. Each model has a maximum number of tokens it can attend to at once (its context window). Outside that window, information is invisible to it.
Languages. Non-English languages often tokenise less efficiently. Norwegian text typically uses more tokens than equivalent English, so it costs more and fits into less context.

Context, not memory¶

LLMs have no persistent memory between conversations unless a system is built around them to provide it. What they have is a context window: a buffer holding the conversation so far (system instructions + user messages + model responses). Anything outside that buffer simply does not exist for the model.

This is why “remember that we are writing a fantasy novel” works inside a chat (it stays in context) but does not transfer to a new chat (the buffer is gone). Tools that appear to remember (custom GPTs, ChatGPT memory) achieve this by quietly pasting relevant snippets back into the context.

In-context learning, or “prompting”¶

A striking discovery in 2020 Brown et al., 2020 was that you can teach an LLM new behaviour just by showing it examples in the prompt — no retraining. This is called in-context learning.

Translate to Norwegian.

EN: The library is open.
NO: Biblioteket er åpent.

EN: Where is the train station?
NO: ☐

The model uses the pattern in the prompt to fill in the next answer. This is why prompting is now a real skill: you are programming the model with examples, not with code.

Three useful patterns:

Zero-shot — just ask. Works for common tasks the model has seen a lot of.
Few-shot — give 2–6 examples of input/output before your real request.
Chain-of-thought — ask the model to think step by step before answering. Often improves reasoning, sometimes at the cost of length.

A practical prompt template¶

For non-trivial tasks, this skeleton works well:

ROLE: You are a [role with relevant expertise].

TASK: [What you want done, in one sentence.]

CONSTRAINTS:
- [Length, style, format]
- [What to avoid]
- [Audience]

CONTEXT:
[Any background the model needs.]

OUTPUT:
[The exact shape you want, with placeholders or an example.]

This is not magic. It is the same template you would write for a freelancer.

The failure modes you need to recognise¶

LLMs fail in characteristic ways. You will see all of these by week 6.

Hallucination¶

The model generates a confident, fluent statement that is false. It might invent a paper title, a court case, a quote, or a study. This is not a bug to be patched — it is a direct consequence of the next-token training objective, which rewards plausibility over truth Bender et al., 2021.

Mitigations:

Ask for sources and check them. (Beware: the model can also hallucinate sources.)
Restrict the task to the model’s strengths (e.g., rewriting, summarising provided text).
Provide grounding — paste the article and ask the model to answer from it, not from its weights.

Sycophancy¶

The model agrees with you, even when you are wrong. Tell it that 2 + 2 = 5 and confidently insist, and it will often capitulate. This is a side effect of training procedures that reward “helpful” answers.

Mitigations:

Don’t lead the witness. Ask “is the following correct?” rather than “I think X is correct, right?”
Ask for counterarguments explicitly.

Verbosity¶

The model produces three paragraphs where one sentence is needed. Solution: ask for fewer words. Specify the exact format. Models obey length instructions reasonably well in 2026.

Style drift¶

A long generation drifts in tone or style. Solution: regenerate from a fresh prompt every few hundred words.

Inability to count, sort, multiply¶

LLMs are not calculators. They will confidently get 17 × 23 wrong. Modern chat products fix this by giving the model access to a code interpreter. If you are doing anything numeric, make sure the model is running code, not just generating prose.

Out-of-date knowledge¶

Models have a training cutoff. They do not know what happened yesterday unless they have web search. Always check the cutoff if recency matters.

Open vs closed models¶

You will work with two kinds of LLM this semester:

Closed/commercial — OpenAI, Anthropic, Google, Mistral (some). You access them via a website or API. The weights are not public. They tend to be the most capable.
Open-weight — Llama, Mistral (some), Gemma, Qwen, DeepSeek, etc. You can download the weights and run them yourself, on a laptop or a server. They lag the frontier by 6–18 months but are closing the gap.

Pragmatic guidance:

For rapid, high-quality drafts, use a frontier closed model.
For research, reproducibility, sensitive data, or learning, prefer an open-weight model you can run locally (e.g., via Ollama or LM Studio).
For production, weigh privacy, cost, latency, and quality.

How to write with an LLM¶

A working pattern that holds up across disciplines:

Think first. Make a bullet outline yourself. Do not ask the model to brainstorm from scratch — that path leads to bland, average prose.
Use the model to argue with the outline. “What is missing? What is wrong? What audience would object?”
Draft yourself. Write a rough version of each section.
Use the model to edit. “Make this paragraph half as long. Make this sentence clearer. Suggest three alternative openings.”
Verify everything claimed as fact against an original source.
Track your prompts. Keep them in a file alongside your draft.

You should treat the LLM as a fast, slightly drunk colleague — useful, opinionated, sometimes wrong, never to be trusted on anything that matters without a check.

This week’s lab: Reflect, Explore, Create¶

Reflect (≈ 30 min, in lab + your weekly log)¶

Pick one prompt and write 150–300 words in your weekly log:

How would you tell, in five seconds, that a paragraph in front of you was written by an LLM? Test your heuristic on three short paragraphs (a mix of your own, a model’s, and a colleague’s) and see how often you are right.
The 2024 EU AI Act European Parliament,Council of the European Union, 2024 introduces transparency requirements for “synthetic content”. What would meaningful labelling look like for an essay drafted with an LLM?
Where, in your own writing process, is the LLM most useful? Where is it actively in the way?

Explore (≈ 45 min, in lab)¶

Hallucination hunt.

Ask an LLM for five academic references on a niche topic in your field (something obscure enough that it might bluff).
Try to find each reference. How many actually exist? How many are partially real (real authors, wrong title; real title, wrong year)?
Write a short note (200 words) on what you found and on the pattern of the hallucinations — which fields did the model invent, which did it transcribe correctly?

Two-model comparison.

Pick a single writing task from your discipline (one paragraph).
Run the same prompt through two different LLMs (e.g., one commercial like ChatGPT or Claude, one open-weight via Ollama).
Compare the outputs: where do the models differ in factuality, in tone, in length, in confidence?

Optional code track.

Use the openai, anthropic, or ollama Python package to call a model from a notebook. UiO has its own generative AI service for staff and students.

from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You answer in one sentence."},
        {"role": "user", "content": "What is the etymology of the word 'fjord'?"},
    ],
)
print(resp.choices[0].message.content)

Create (≈ 45 min, in lab + carry-over to your portfolio)¶

Build a personal prompt library for your discipline. This is one of the most useful artefacts you can leave the course with.

Pick three writing tasks you actually do in your field (e.g., a paragraph for a project report, an explanation of a concept for non-experts, a critique of a paper, a translation, a summary).
For each task, build a reusable prompt using the ROLE / TASK / CONSTRAINTS / CONTEXT / OUTPUT template above. Save each prompt as a templated form with {{placeholders}} for the parts you would swap in next time.
Test each template with one concrete instance and paste the output beside it.
Commit prompt-library.md to your portfolio. You will reuse and refine this all semester.

Going further¶

Vaswani et al., Attention Is All You Need Vaswani et al., 2017 — the founding paper of the transformer.
Stephen Wolfram, What Is ChatGPT Doing... and Why Does It Work? Wolfram, 2023 — best intuitive explanation of LLMs.
The Hugging Face course Hugging Face, 2024 — free and code-first; includes the LLM track.
Bender et al., On the Dangers of Stochastic Parrots Bender et al., 2021 — the critical take you have to read.
The UiO AI service guidelines University of Oslo, 2025.

References¶

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., & others. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2005.14165
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1706.03762
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT). 10.1145/3442188.3445922
Regulation (EU) 2024/1689 — The AI Act. (2024). European Parliament. https://eur-lex.europa.eu/eli/reg/2024/1689/oj
Wolfram, S. (2023). What Is ChatGPT Doing\ldots and Why Does It Work? https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/
Hugging Face. (2024). The Hugging Face Course: Transformers, Diffusers, and LLMs. Hugging Face. https://huggingface.co/learn
AI at UiO. (2025). University of Oslo. https://www.uio.no/english/services/ai/