Key Moments
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Flow Matching, a recent paradigm, is now the default for image generation due to its efficiency, but top models are ditching VAEs and text encoders, scaling to billions of parameters for state-of-the-art results.
Key Insights
The current state-of-the-art in image generation models, like Hydream 01, are scaling to 8 billion and even 200 billion parameters, demonstrating that massive scale can overcome limitations like operating directly in pixel space without a VAE.
Flow matching is the default paradigm for image generation in 2026, offering a more efficient approach than diffusion or score matching, especially with variants like rectified flow enabling faster inference with fewer steps.
Video generation models extend image generation principles by adding a temporal dimension, requiring models to maintain temporal consistency and often employing a 3D VAE for spatio-temporal compression.
Image editing tasks are shifting from 'generation from scratch' to specialized editing actions, potentially driven by Vision-Language Models (VLMs) that can interpret editing intents and translate them into actionable software commands.
Diffusion models are being adapted for Large Language Models (LLMs) to enable non-autoregressive text generation, potentially offering up to 10x speedups for tasks like coding by generating entire sequences from noise at once.
The cost of generating high-quality images is approximately 10 cents per megapixel, with efforts focused on distillation and hardware innovation to reduce this price and improve accessibility.
Core image generation paradigms: diffusion, score matching, and flow matching
The course began by exploring foundational paradigms for image generation. Diffusion models work by progressively corrupting images into noise via a forward process and then learning to reverse this process to generate clean images from noise. This involves estimating the noise added at each step, often using an L2 regression loss. Score matching, the second paradigm, focuses on learning the 'score' of the data distribution, which is the gradient of the log probability density. This score acts like a compass, guiding generation from noise towards the data distribution. It was shown that both diffusion and score matching can be unified under a continuous formulation via stochastic differential equations. The third and most recent paradigm, flow matching, frames generation as transporting probability density from an initial distribution to a target data distribution. It involves learning a vector field that describes the velocity of particles moving between distributions. Flow matching is considered the preferred method in 2026, particularly variants like rectified flow, which offer straighter paths for faster inference with fewer steps. This approach is favored for its practical efficiency and is often used by default for generative tasks.
Representing images and latent spaces for efficient generation
To manage the high dimensionality and redundancy of pixel data, models employ latent spaces, learned through techniques like autoencoders and variational autoencoders (VAEs). Autoencoders compress images into a lower-dimensional latent space, but controlling the shape and distribution of this space can be challenging. VAEs add a regularization term to the loss function, structuring the latent space to mimic a prior distribution, making it amenable to sampling and generation. This lecture also covered the importance of input representations, including transformer-based encoders like the Vision Transformer (ViT) and multimodal approaches like CLIP, which use contrastive learning. Classifier-free guidance was introduced as a method to condition generation on prompts, significantly improving alignment between input text and output images.
Architectures for generation: From U-Nets to Diffusion Transformers
The course delved into popular architectures used for image generation. The U-Net, with its downsampling and upsampling paths and skip connections, was discussed for its utility in capturing both global context and fine details. However, a significant limitation of such architectures is their inability for distant patches to interact directly. This led to the rise of Transformer-based models, particularly the Diffusion Transformer (DiT), which emerged in 2022. DiTs leverage self-attention mechanisms to allow all patches within an image to interact, enabling better handling of global coherence and complex spatial relationships. This architecture proved so effective that multimodal DiTs, which incorporate conditions into joint attention, became the dominant trend by 2026. These models, often trained on vast datasets, represent the current state-of-the-art in terms of architectural design for generative vision tasks.
Training, evaluation, and state-of-the-art models
Training generative models involves several stages, starting with pre-training on large, diverse datasets. This is followed by optional continued training on specific domains or fine-tuning, such as with DreamBooth, which uses rare tokens to teach models to generate particular subjects. Techniques like LoRA (Low-Rank Adaptation) are used for efficient fine-tuning. Distillation methods aim to shorten inference steps, reducing computational cost. Evaluation is crucial, with metrics like Elo rating used for human-preference-based leaderboards and FID (Fréchet Inception Distance) for automated evaluation, measuring the distance between generated and real image distributions. By mid-2026, top models like OpenAI's GPT Image and Google's models dominate closed-source leaderboards, while open-weight models from Hydra, Quinn Image, and Flux 2, based on flow matching and DiT architectures, are also prominent. Notably, some cutting-edge models are achieving remarkable results by scaling to enormous parameter counts (up to 200 billion) and operating directly in pixel space, eschewing VAEs and pre-trained text encoders, suggesting that massive scale can overcome traditional architectural trade-offs.
Extending generation to video
Video generation is approached as an extension of image generation, introducing the added dimension of time. Key challenges include maintaining temporal consistency, ensuring plausible motion, and managing computational complexity. Models often utilize a 3D VAE for spatio-temporal compression, creating a latent space where features represent both spatial content and temporal progression. Causal VAEs are employed to ensure that encoding at a given frame depends only on past and present frames, not future ones, which is crucial for efficient streaming and temporal coherence. The generation process then occurs within this spatio-temporal latent space, using architectures like DiTs to ensure coherence across both space and time. Metrics like Frechet Video Distance (FVD) are extensions of image evaluation metrics.
Advanced applications: Image editing and diffusion for LLMs
Image editing is evolving from simple text-to-image prompts to more nuanced tasks. Instead of regenerating an entire image, editing aims to perform specific, controllable modifications while preserving the original content. This is increasingly being tackled by Vision-Language Models (VLMs) that can interpret user intent and generate editing actions, akin to commands in graphics software. Research is focusing on collecting data of user edits and their inferred intents to train VLMs for this purpose. Concurrently, diffusion models are being adapted for Large Language Models (LLMs). This involves treating text generation like image generation, starting from noisy text representations and progressively denoising them to produce coherent sequences. This non-autoregressive approach, often using mask tokens for noise, can offer significant speedups (up to 10x) compared to traditional autoregressive LLMs, making it particularly suitable for tasks like coding and fill-in-the-middle generation. However, training these diffusion-based LLMs is more computationally expensive.
Challenges and future directions: Cost, data, and trust
The field faces ongoing challenges, including the high cost of generating high-quality media, with current top models priced around 10 cents per megapixel. Research into hardware acceleration for transformer operations and distillation techniques aims to mitigate these costs. Data quality and trust are paramount, especially with the increasing indistinguishability between real and generated content. Concerns about model collapse and the dilution of true data distributions are being addressed through methods like C2PA for content provenance and watermarking, such as Google DeepMind's SynthID, to identify AI-generated media. Safety considerations are also crucial, with companies implementing policy guards and legal frameworks evolving to manage harmful content generation. Future research directions include improving multimodal reasoning with images, advancing controlled image editing through agents, synthesizing information from multiple modalities, and potentially revolutionizing fields like robotics and medicine. The ultimate goal is to make AI systems more trustworthy, accessible, and capable of handling complex, real-world tasks.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Books
●Concepts
Common Questions
Diffusion models learn to reverse a forward process that gradually corrupts clean images into Gaussian noise. By learning this reverse process, they can generate clean images starting from noise. The goal is to move from an easy-to-sample distribution (Gaussian) to a complex data distribution (images).
Topics
Mentioned in this video
A Stanford course on Diffusion & Large Vision Models, concluding with this 8th lecture.
A class of models that use the gradient of the log probability density (score) to guide sampling from noise to data distribution; discussed as a 'variance exploding' formulation.
A paradigm for image generation that frames the problem as moving probability density from an initial distribution to a target distribution via a vector field. It is highlighted as the default method used by most models in 2026.
A variant of flow matching that allows paths to be straighter, reducing the number of steps required for image sampling during inference.
A model used to learn a structured latent space for images by compressing information into fewer dimensions, making it easier for generative models to learn. It uses an Evidence Lower Bound (ELBO) loss.
A transformer-based encoder for image representation, using self-attention mechanisms originally developed for text.
A method to incentivize image generation to be more aligned with conditioning inputs like text prompts.
A popular image generation architecture composed of downsampling and upsampling parts with copy-and-crop connections, used for gaining global understanding and reconstructing image shapes.
An architecture that improved on U-Net by allowing distant image patches to interact directly via self-attention mechanisms, and injecting conditions using an adaptive layer norm framework.
An advanced Diffusion Transformer architecture that considers text conditions as part of a joint attention mechanism for improved image generation.
A sampling distribution for noise levels during model training that focuses more on middle steps, which are crucial for making important generation decisions.
A technique that allows for tuning only a subset of a large model's weights, making fine-tuning more efficient and memory-friendly.
A category of methods aimed at shortening the number of inference steps needed to generate samples from a model, thereby reducing generation time and computational cost.
An example of a distillation method for reducing inference steps in generative models.
A smart way of computing pairwise comparisons between generative models by taking into consideration the strength of the opponent in human preference ratings, providing a more robust performance metric than simple win rates.
A metric similar to FID, used to evaluate the quality of generated videos by measuring the distance between generated video distributions and real video distributions, using a pre-trained encoder specific for videos.
A 3D Variational Autoencoder designed for video generation where convolutions are asymmetric, ensuring that feature maps for a given frame only depend on itself and previous frames, allowing for streaming encoding and decoding.
An architecture initially designed for natural language processing tasks like translation, but adapted for vision tasks due to its scalability benefits, forming the basis of Diffusion Transformers and other generative models.
An adaptation of the GRPO optimization method from the LLM world to the vision domain, combining flow matching with optimization techniques.
An approach to generating text block by block using diffusion, which can be useful for handling variable output lengths in text generation, especially when combined with auto-regressive elements.
A model that combines different modalities (like text and images) in the same space using a contrastive loss.
A tuning technique for image generation models that enables them to generate images of a specific subject or person multiple times by training with a small set of their images and a rare token.
Models capable of processing both text and image inputs, which can be leveraged to automate image quality evaluation by acting as 'judges'.
An image generation model from OpenAI, ranked among the best performing models on a public leaderboard.
A company developed Hydream 01, an open-weight image generation model described as the highest-ranked on a public leaderboard as of late May 2026.
An open-weight image generation model from Hydra, ranked as the highest-performing open-weight model on a public leaderboard as of late May 2026.
An image generation model discussed in Lecture 5, identified as a Multimodal Diffusion Transformer, using a flow matching loss, VAEs, and text embeddings based on Quinn.
An image generation model from Black Forest Lab, based on rectified flow, a combination of single and double stream Diffusion Transformers, VAEs, and Mistral 3 for text embeddings.
A pre-trained text encoder used in the Flux 2 models for generating text embeddings.
A text embedding model used in Quinn Image for generating text representations.
A newly published, top-ranked open-weight image generation model that utilizes flow matching, a transformer-based architecture, but notably operates directly in pixel space without a VAE or pre-trained text encoder. It achieves impressive results through significant scaling.
A well-known image editing software, mentioned as an interface for vision language models (VLMs) to execute inferred editing actions.
An encoder-only architecture in the text world that uses a similar pre-training task involving masking tokens, but with a fixed masking scheme rather than a variable noise level.
A company founded by a Stanford professor that is actively working on diffusion-based models for text generation.
A watermarking technology from Google DeepMind that hides patterns within the pixels of AI-generated images to reveal their origin, overcoming the limitation of metadata loss in screenshots.
A type of software that assists with coding, mentioned as a tool for guiding through code flows from GitHub repositories to learn about AI methods.
A leading AI research lab, mentioned for having the top-ranked models, GPT Image and another unnamed model, on a public leaderboard.
A major technology company, mentioned for having two top-ranked AI models on a public leaderboard.
An AI company mentioned for having a top-ranked model on a public leaderboard.
The lab that developed the Flux 2 models, which are based on rectified flow and a combination of single and double stream diffusion transformers.
A platform for hosting code, recommended for cloning repositories of research papers to understand how methods work.
A social media platform highlighted for its glowing community discussing AI topics and interesting recommendations.
A norm gaining traction among software companies that attaches a history or metadata to AI-generated images to reveal their origin, as a way to counter 'model collapse' and ensure trust.
A leading AI research lab, developers of SynthID watermarking technology.
The institution hosting the CME 296 course and other vision-related courses like 231N.
More from Stanford Online
View all 68 summaries
102 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation
66 minStanford CS153 Frontier Systems | The Road Ahead: Resilience Required
80 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training
47 minStanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Infrastructure, Capstone Case
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free