Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

7. AI and video

Text-to-video, image-to-video, and the time problem

University of Oslo

Why this matters

Image, text, and audio are all easier than video. A video is just a sequence of frames, but generating a sequence of frames that are consistent over time — same character, same lighting, same physics — is dramatically harder than generating any single frame.

By 2024 the first credible text-to-video systems arrived (Sora, Veo, Runway Gen-3, Kling, Pika, Luma). By 2026 they are good enough to do short cinematic clips, ad inserts, music videos, and B-roll for documentary work. They are still not good enough for sustained narrative film. This chapter looks at what is possible now and what is just over the horizon.

Why video is hard

Three problems compound:

  1. Cost. A 5-second clip at 30 fps is 150 frames. Even at compressed latent resolution, that is 100× the inference cost of a single image.
  2. Consistency. Each frame must be coherent with the previous one — same character, same scene, same lighting. This requires the model to “remember” what it just drew.
  3. Physics. Cloth has to fall, water has to flow, hands have to grip. Image models can fake plausible physics in one frame; video models have to fake trajectories.
A schematic showing a stack of video frames over a time axis, with arrows showing temporal consistency requirements.

The hard part of video is not the frames; it is the time axis between them.

Modern video models use one of three strategies:

  • Frame-by-frame with attention across frames. Each frame “sees” the others through cross-attention.
  • 3D diffusion (space × space × time). Treat the whole clip as a single tensor; diffuse all at once. Conceptually clean, computationally expensive.
  • Two-stage: first generate keyframes, then interpolate the frames in between using a separate, lighter model.

By 2026 the best public systems use combinations of all three.

What current video models can do

In 2026, off-the-shelf tools (Sora, Runway, Veo, Kling, Pika, Luma) reliably produce:

  • 5–15 second clips at 720p or 1080p.
  • Photorealistic or stylised output based on prompts.
  • Camera motion control — orbit, dolly, push, pull, hand-held.
  • Image-to-video — take a still and animate it.
  • Keyframe-to-keyframe — first frame + last frame → in-between motion.
  • Lip-sync — drive a static face image with an audio file.
  • Style transfer across an existing clip.

They struggle with:

  • Longer clips. Coherence falls off rapidly past 10–20 seconds.
  • Specific people and IPs. Most commercial tools refuse explicit requests for real people, brands, or copyrighted characters.
  • Crowds and complex interactions. Two people having a conversation is at the edge of reliable.
  • Hands, text, and small details. Same as image models, but in motion.
  • Reading printed text on signs and screens within a clip.

The vocabulary of video prompting

Video prompts are richer than image prompts because they include motion. A working template:

[Subject + key features],
[scene / setting],
[camera angle and movement],
[lighting and time of day],
[style],
[motion within the scene]

Worked example:

A young woman walking quickly along the riverbank in Oslo,
autumn leaves on the ground, river to the left,
medium-wide shot from a hand-held camera following her from behind,
overcast late-afternoon light,
documentary style,
she pulls a beanie out of her pocket and puts it on as she walks

Useful concepts:

  • Camera moves: dolly in, truck left, crane up, orbit, push in, static.
  • Motion: do not assume “looking sad” → “starts to cry”. Spell out the motion in time.
  • Cuts: most current systems generate a single shot. For multi-shot sequences you stitch in a video editor.
  • Aspect ratio: 16:9 for landscape work; 9:16 for vertical (TikTok / Reels / Shorts).

Workflow: where video AI actually fits

A realistic production workflow in 2026:

  1. Storyboard in image-AI (chapter 5) until you have a frame you like for each shot.
  2. Animate each frame with image-to-video, possibly with end-frame conditioning.
  3. Stitch the shots in a video editor (DaVinci Resolve is free; CapCut, Premiere, Final Cut work too).
  4. Add audio — voice (chapter 6), music (chapter 6), Foley, ambience.
  5. Colour-grade and finish in the same editor.

The pieces of this pipeline that are most transformed by AI are the first and the second. The traditional editorial work (pacing, sound design, colour) remains very human.

Lip-sync and avatars

A specific genre of video AI worth flagging: talking heads. Tools like HeyGen, Synthesia, D-ID, and many others let you upload (or pick from a stock library) a photograph or short clip of a person and drive it with a TTS voice. The output is a video of “that” person reading any text.

This is widely used for:

  • corporate training videos,
  • product explainers,
  • multilingual versions of recorded talks,
  • localised marketing.

It is also a perfect tool for political and personal manipulation. As with voice cloning (chapter 6): explicit consent and clear labelling are non-negotiable.

A short word on cost and access

Video generation is still the most computationally expensive medium per second of output. As of 2026:

  • A 5-second 720p clip uses roughly the same compute as 100 image generations.
  • Most consumer tools price video in credits; one clip costs 5–20 credits depending on length and quality.
  • Free tiers are very limited; the cheaper paid plans (around USD/EUR 10–20 per month) tend to be the right starting point for a course.

If your laptop has 16 GB of RAM and no dedicated GPU, do video work in a hosted tool. Local video generation is for workstations with serious GPUs.

This week’s lab: Reflect, Explore, Create

Reflect (≈ 30 min, in lab + your weekly log)

Pick one prompt and write 150–300 words in your weekly log:

  1. What kinds of moving-image work are easiest for current video AI? Which kinds remain stubbornly out of reach?
  2. Watch one AI-generated short film online (search for “AI short film 2026”). Pause every two seconds. Where does coherence break down? Where does it hold?
  3. Imagine a journalist using AI video for a news report. List three legitimate uses and three uses that would constitute serious misuse. What is the difference?

Pair critique. In the last 20 minutes of class, pair up with another student. Watch each other’s 30-second piece (see Create below) silently, then write:

  • One thing the AI clearly produced well.
  • One thing that gives it away as AI.
  • One thing that would be the next step.

Explore (≈ 30 min, in lab) — a 10-second clip from a single image

  1. Generate a strong still image (chapter 5) of a clear subject in a clear scene.
  2. In a video tool of your choice, run image-to-video with a short motion prompt — e.g., “the subject turns their head slowly to the right while the camera pushes in”.
  3. Generate three variations. Pick the best.
  4. Optional: generate an end-frame and re-run with both first and last frame conditioned.
  5. Note where coherence holds (within a single shot) and where it breaks (motion of multiple objects, hands, fast camera moves).

Create (≈ 60 min, in lab + carry-over to your portfolio) — a 30-second mini-piece

  1. Storyboard three shots of a small idea — for example, “a tourist arriving in Oslo at sunrise”.
  2. Generate each shot (3–5 seconds each).
  3. Add an ambient soundtrack from chapter 6.
  4. Edit the three shots together in a video editor.
  5. Add a clearly visible provenance card at the end of the piece (one frame, white text on black): tools, models, year. This is good practice and may soon be required under the EU AI Act for deepfake-adjacent material.
  6. Watch the result twice. Write down what works, what does not, and what you would re-do. Commit the piece + provenance card to your portfolio.

Going further

References
  1. Runway. (2024). Runway Research. Runway. https://runwayml.com/research
  2. OpenAI. (2024). Sora — Creating Video from Text. OpenAI. https://openai.com/sora
  3. Maslej, N., Fattorini, L., Perrault, R., Parli, V., Reuel, A., Brynjolfsson, E., Etchemendy, J., Ligett, K., Lyons, T., Manyika, J., Niebles, J. C., Shoham, Y., Wald, R., & Clark, J. (2024). The AI Index Report. Stanford Institute for Human-Centered Artificial Intelligence. https://aiindex.stanford.edu/report/
  4. Regulation (EU) 2024/1689 — The AI Act. (2024). European Parliament. https://eur-lex.europa.eu/eli/reg/2024/1689/oj
  5. Regulatory Framework on AI — Official Summary. (2024). European Commission, Directorate-General for Communications Networks, Content. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai