How does the 'score' concept contribute to image generation?

The 'score' represents the gradient of the log probability density of the data distribution, indicating the direction to move from noise towards clean images. It acts like a compass, guiding the generative process towards valid data points, and can be estimated via denoising score matching.

Why is flow matching considered the default paradigm for image generation in 2026?

Flow matching frames image generation as a mass transport problem, mapping probability density from an initial to a target distribution via a vector field. It's preferred for its efficiency and ability to generate images with fewer steps, particularly with variants like Rectified Flow that simplify the paths.

What is the role of the Variational Autoencoder (VAE) in image generation?

The VAE learns a compact, structured latent space by compressing images into a lower-dimensional representation. This makes it easier for generative models to learn and operate, though it can sometimes lead to a loss of fidelity in the reconstructed image. However, recent models are exploring pixel-space generation without VAEs.

How do image generation models like DiT and UNet incorporate user prompts and conditions?

Architectures like U-Net and Diffusion Transformers (DiT) take a noise level, a condition (like a user prompt), and a noisy latent as input to predict the velocity field. DiT, in particular, injects conditions using an adaptive layer norm framework and can utilize joint attention for multimodal inputs like text and image.

What factors are considered in training diffusion models for realistic image generation?

Training involves sampling noise levels (often from a logit normal distribution to prioritize 'middle' noise levels), considering image resolution's impact on perceived noise, and multi-stage pipelines including pre-training (on large datasets), continued training (for specific domains like 'teddy bears'), and fine-tuning (e.g., DreamBooth with LoRA) to tailor models.

How are generated images evaluated for quality and prompt alignment?

Evaluation is done through human ratings (pairwise comparisons using metrics like the ELO score, which considers opponent strength) and automated metrics like FID (Fréchet Inception Distance), which measures the distance between real and generated image distributions. Multimodal Large Language Models (MLLMs) can also serve as automated judges.

What are the key considerations when extending image generation models to video generation?

Video generation adds a 'time' dimension, requiring temporal consistency between frames and efficient computation due to increased data dimensionality. It often involves 3D VAEs (specifically 'causal VAEs') that compress both spatial and temporal information, and DIT-based architectures that operate on space-time patches to ensure coherence across sequences.

How can diffusion principles be applied to improve Large Language Model (LLM) generation?

Instead of slow auto-regressive generation, diffusion for LLMs starts from a completely noised text sequence (e.g., masked tokens) and progressively denoises it to produce the final output. This can significantly speed up generation (up to 10x for tasks like coding) and is well-suited for 'fill-in-the-middle' tasks.

What are the current challenges in generative AI development?

Challenges include the computational cost of training and inference (addressed partially by distillation and hardware research), data quality issues (e.g., 'model collapse' from training on AI-generated data), and ethical concerns like safety and trustworthiness of generated content. Solutions like C2PA and digital watermarking (e.g., SynthID) are being developed for provenance.

Key Moments

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Stanford Online

Education6 min read110 min video

Jun 1, 2026|13,959 views|237|8

Stanford Stanford Online Large Language Models LLM

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Flow Matching, a recent paradigm, is now the default for image generation due to its efficiency, but top models are ditching VAEs and text encoders, scaling to billions of parameters for state-of-the-art results.

Key Insights

The current state-of-the-art in image generation models, like Hydream 01, are scaling to 8 billion and even 200 billion parameters, demonstrating that massive scale can overcome limitations like operating directly in pixel space without a VAE.

Flow matching is the default paradigm for image generation in 2026, offering a more efficient approach than diffusion or score matching, especially with variants like rectified flow enabling faster inference with fewer steps.

Video generation models extend image generation principles by adding a temporal dimension, requiring models to maintain temporal consistency and often employing a 3D VAE for spatio-temporal compression.

Image editing tasks are shifting from 'generation from scratch' to specialized editing actions, potentially driven by Vision-Language Models (VLMs) that can interpret editing intents and translate them into actionable software commands.

Diffusion models are being adapted for Large Language Models (LLMs) to enable non-autoregressive text generation, potentially offering up to 10x speedups for tasks like coding by generating entire sequences from noise at once.

The cost of generating high-quality images is approximately 10 cents per megapixel, with efforts focused on distillation and hardware innovation to reduce this price and improve accessibility.

Core image generation paradigms: diffusion, score matching, and flow matching

The course began by exploring foundational paradigms for image generation. Diffusion models work by progressively corrupting images into noise via a forward process and then learning to reverse this process to generate clean images from noise. This involves estimating the noise added at each step, often using an L2 regression loss. Score matching, the second paradigm, focuses on learning the 'score' of the data distribution, which is the gradient of the log probability density. This score acts like a compass, guiding generation from noise towards the data distribution. It was shown that both diffusion and score matching can be unified under a continuous formulation via stochastic differential equations. The third and most recent paradigm, flow matching, frames generation as transporting probability density from an initial distribution to a target data distribution. It involves learning a vector field that describes the velocity of particles moving between distributions. Flow matching is considered the preferred method in 2026, particularly variants like rectified flow, which offer straighter paths for faster inference with fewer steps. This approach is favored for its practical efficiency and is often used by default for generative tasks.

Representing images and latent spaces for efficient generation

To manage the high dimensionality and redundancy of pixel data, models employ latent spaces, learned through techniques like autoencoders and variational autoencoders (VAEs). Autoencoders compress images into a lower-dimensional latent space, but controlling the shape and distribution of this space can be challenging. VAEs add a regularization term to the loss function, structuring the latent space to mimic a prior distribution, making it amenable to sampling and generation. This lecture also covered the importance of input representations, including transformer-based encoders like the Vision Transformer (ViT) and multimodal approaches like CLIP, which use contrastive learning. Classifier-free guidance was introduced as a method to condition generation on prompts, significantly improving alignment between input text and output images.

Architectures for generation: From U-Nets to Diffusion Transformers

The course delved into popular architectures used for image generation. The U-Net, with its downsampling and upsampling paths and skip connections, was discussed for its utility in capturing both global context and fine details. However, a significant limitation of such architectures is their inability for distant patches to interact directly. This led to the rise of Transformer-based models, particularly the Diffusion Transformer (DiT), which emerged in 2022. DiTs leverage self-attention mechanisms to allow all patches within an image to interact, enabling better handling of global coherence and complex spatial relationships. This architecture proved so effective that multimodal DiTs, which incorporate conditions into joint attention, became the dominant trend by 2026. These models, often trained on vast datasets, represent the current state-of-the-art in terms of architectural design for generative vision tasks.

Training, evaluation, and state-of-the-art models

Training generative models involves several stages, starting with pre-training on large, diverse datasets. This is followed by optional continued training on specific domains or fine-tuning, such as with DreamBooth, which uses rare tokens to teach models to generate particular subjects. Techniques like LoRA (Low-Rank Adaptation) are used for efficient fine-tuning. Distillation methods aim to shorten inference steps, reducing computational cost. Evaluation is crucial, with metrics like Elo rating used for human-preference-based leaderboards and FID (Fréchet Inception Distance) for automated evaluation, measuring the distance between generated and real image distributions. By mid-2026, top models like OpenAI's GPT Image and Google's models dominate closed-source leaderboards, while open-weight models from Hydra, Quinn Image, and Flux 2, based on flow matching and DiT architectures, are also prominent. Notably, some cutting-edge models are achieving remarkable results by scaling to enormous parameter counts (up to 200 billion) and operating directly in pixel space, eschewing VAEs and pre-trained text encoders, suggesting that massive scale can overcome traditional architectural trade-offs.

Extending generation to video

Video generation is approached as an extension of image generation, introducing the added dimension of time. Key challenges include maintaining temporal consistency, ensuring plausible motion, and managing computational complexity. Models often utilize a 3D VAE for spatio-temporal compression, creating a latent space where features represent both spatial content and temporal progression. Causal VAEs are employed to ensure that encoding at a given frame depends only on past and present frames, not future ones, which is crucial for efficient streaming and temporal coherence. The generation process then occurs within this spatio-temporal latent space, using architectures like DiTs to ensure coherence across both space and time. Metrics like Frechet Video Distance (FVD) are extensions of image evaluation metrics.

Advanced applications: Image editing and diffusion for LLMs

Image editing is evolving from simple text-to-image prompts to more nuanced tasks. Instead of regenerating an entire image, editing aims to perform specific, controllable modifications while preserving the original content. This is increasingly being tackled by Vision-Language Models (VLMs) that can interpret user intent and generate editing actions, akin to commands in graphics software. Research is focusing on collecting data of user edits and their inferred intents to train VLMs for this purpose. Concurrently, diffusion models are being adapted for Large Language Models (LLMs). This involves treating text generation like image generation, starting from noisy text representations and progressively denoising them to produce coherent sequences. This non-autoregressive approach, often using mask tokens for noise, can offer significant speedups (up to 10x) compared to traditional autoregressive LLMs, making it particularly suitable for tasks like coding and fill-in-the-middle generation. However, training these diffusion-based LLMs is more computationally expensive.

Challenges and future directions: Cost, data, and trust

The field faces ongoing challenges, including the high cost of generating high-quality media, with current top models priced around 10 cents per megapixel. Research into hardware acceleration for transformer operations and distillation techniques aims to mitigate these costs. Data quality and trust are paramount, especially with the increasing indistinguishability between real and generated content. Concerns about model collapse and the dilution of true data distributions are being addressed through methods like C2PA for content provenance and watermarking, such as Google DeepMind's SynthID, to identify AI-generated media. Safety considerations are also crucial, with companies implementing policy guards and legal frameworks evolving to manage harmful content generation. Future research directions include improving multimodal reasoning with images, advancing controlled image editing through agents, synthesizing information from multiple modalities, and potentially revolutionizing fields like robotics and medicine. The ultimate goal is to make AI systems more trustworthy, accessible, and capable of handling complex, real-world tasks.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

Common Questions

Diffusion models learn to reverse a forward process that gradually corrupts clean images into Gaussian noise. By learning this reverse process, they can generate clean images starting from noise. The goal is to move from an easy-to-sample distribution (Gaussian) to a complex data distribution (images).

Topics

Ai Safety AI & Machine Learning Technology & Innovation Generative AI Large Language Models Image Generation Model Training Diffusion Models Transformer Architecture Computational Efficiency Image Editing Video Generation Latent Space AI Evaluation Metrics

Mentioned in this video

Concepts

CME 296

A Stanford course on Diffusion & Large Vision Models, concluding with this 8th lecture.

Score-based Generative Models

A class of models that use the gradient of the log probability density (score) to guide sampling from noise to data distribution; discussed as a 'variance exploding' formulation.

Flow matching

A paradigm for image generation that frames the problem as moving probability density from an initial distribution to a target distribution via a vector field. It is highlighted as the default method used by most models in 2026.

Rectified Flow

A variant of flow matching that allows paths to be straighter, reducing the number of steps required for image sampling during inference.

Variational Autoencoder

A model used to learn a structured latent space for images by compressing information into fewer dimensions, making it easier for generative models to learn. It uses an Evidence Lower Bound (ELBO) loss.

Vision Transformer

A transformer-based encoder for image representation, using self-attention mechanisms originally developed for text.

Classifier-Free Guidance

A method to incentivize image generation to be more aligned with conditioning inputs like text prompts.

U-Net

A popular image generation architecture composed of downsampling and upsampling parts with copy-and-crop connections, used for gaining global understanding and reconstructing image shapes.

Diffusion Transformer

An architecture that improved on U-Net by allowing distant image patches to interact directly via self-attention mechanisms, and injecting conditions using an adaptive layer norm framework.

Multimodal Diffusion Transformer

An advanced Diffusion Transformer architecture that considers text conditions as part of a joint attention mechanism for improved image generation.

Logit-Normal Distribution

A sampling distribution for noise levels during model training that focuses more on middle steps, which are crucial for making important generation decisions.

LoRA

A technique that allows for tuning only a subset of a large model's weights, making fine-tuning more efficient and memory-friendly.

Distillation

A category of methods aimed at shortening the number of inference steps needed to generate samples from a model, thereby reducing generation time and computational cost.

Progressive Distillation

An example of a distillation method for reducing inference steps in generative models.

Elo rating system

A smart way of computing pairwise comparisons between generative models by taking into consideration the strength of the opponent in human preference ratings, providing a more robust performance metric than simple win rates.

Fresh Video Distance

A metric similar to FID, used to evaluate the quality of generated videos by measuring the distance between generated video distributions and real video distributions, using a pre-trained encoder specific for videos.

Causal VAE

A 3D Variational Autoencoder designed for video generation where convolutions are asymmetric, ensuring that feature maps for a given frame only depend on itself and previous frames, allowing for streaming encoding and decoding.

Transformer

An architecture initially designed for natural language processing tasks like translation, but adapted for vision tasks due to its scalability benefits, forming the basis of Diffusion Transformers and other generative models.

Flow GPO

An adaptation of the GRPO optimization method from the LLM world to the vision domain, combining flow matching with optimization techniques.

Block Diffusion

An approach to generating text block by block using diffusion, which can be useful for handling variable output lengths in text generation, especially when combined with auto-regressive elements.

Software & Apps

CLIP

A model that combines different modalities (like text and images) in the same space using a contrastive loss.

DreamBooth

A tuning technique for image generation models that enables them to generate images of a specific subject or person multiple times by training with a small set of their images and a rare token.

Large Language Models

Models capable of processing both text and image inputs, which can be leveraged to automate image quality evaluation by acting as 'judges'.

GPT Image

An image generation model from OpenAI, ranked among the best performing models on a public leaderboard.

Hydra

A company developed Hydream 01, an open-weight image generation model described as the highest-ranked on a public leaderboard as of late May 2026.

Hydream 01

An open-weight image generation model from Hydra, ranked as the highest-performing open-weight model on a public leaderboard as of late May 2026.

Quinn Image

An image generation model discussed in Lecture 5, identified as a Multimodal Diffusion Transformer, using a flow matching loss, VAEs, and text embeddings based on Quinn.

Flux

An image generation model from Black Forest Lab, based on rectified flow, a combination of single and double stream Diffusion Transformers, VAEs, and Mistral 3 for text embeddings.

Mistral

A pre-trained text encoder used in the Flux 2 models for generating text embeddings.

Quinn

A text embedding model used in Quinn Image for generating text representations.

Open-weight model

A newly published, top-ranked open-weight image generation model that utilizes flow matching, a transformer-based architecture, but notably operates directly in pixel space without a VAE or pre-trained text encoder. It achieves impressive results through significant scaling.

Photoshop

A well-known image editing software, mentioned as an interface for vision language models (VLMs) to execute inferred editing actions.

BERT

An encoder-only architecture in the text world that uses a similar pre-training task involving masking tokens, but with a fixed masking scheme rather than a variable noise level.

Inception

A company founded by a Stanford professor that is actively working on diffusion-based models for text generation.

SynthID

A watermarking technology from Google DeepMind that hides patterns within the pixels of AI-generated images to reveal their origin, overcoming the limitation of metadata loss in screenshots.

AI Assistant Coding Software

A type of software that assists with coding, mentioned as a tool for guiding through code flows from GitHub repositories to learn about AI methods.

Companies

OpenAI

A leading AI research lab, mentioned for having the top-ranked models, GPT Image and another unnamed model, on a public leaderboard.

Google

A major technology company, mentioned for having two top-ranked AI models on a public leaderboard.

XAI

An AI company mentioned for having a top-ranked model on a public leaderboard.

Black Forest Lab

The lab that developed the Flux 2 models, which are based on rectified flow and a combination of single and double stream diffusion transformers.

GitHub

A platform for hosting code, recommended for cloning repositories of research papers to understand how methods work.

Twitter

A social media platform highlighted for its glowing community discussing AI topics and interesting recommendations.

Organizations

C2PA

A norm gaining traction among software companies that attaches a history or metadata to AI-generated images to reveal their origin, as a way to counter 'model collapse' and ensure trust.

Google DeepMind

A leading AI research lab, developers of SynthID watermarking technology.

Stanford University

The institution hosting the CME 296 course and other vision-related courses like 231N.

Books

Deepser

A research paper published a few months ago that explores considering text as images of text and using an OCR mechanism to process it, suggesting it as a promising direction for text understanding.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free