Key Moments

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 1 - Diffusion

Stanford OnlineStanford Online
Education5 min read107 min video
Apr 10, 2026|3,668 views|269|10
Save to Pod
TL;DR

Diffusion models can generate photorealistic images by iteratively removing noise, but they require thousands of steps, making inference slow, though techniques like DDIM can speed this up significantly.

Key Insights

1

Image generation models have progressed from low-resolution, black and white images in 2014 to high-resolution, colored images generated by models like Chipt.

2

The DDPM paper introduced a successful diffusion paradigm for images, involving a forward process of adding noise and a reverse process of denoising.

3

In the diffusion process, starting from a Gaussian noise distribution simplifies mathematical formulations due to its favorable properties.

4

The DDPM training loss is derived from maximizing a lower bound (ELBO) of the data's probability, which simplifies to an L2 regression on the added noise.

5

DDIM (Denoising Diffusion Implicit Models) accelerates inference by allowing larger steps between noise levels, reducing the number of required sequential denoising operations from thousands to tens or hundreds.

6

While DDIM significantly speeds up image generation, there's a trade-off between speed and generation quality, with typical speedups ranging from 10x to 100x.

The evolution and goals of image generation models

The field of image generation has seen remarkable progress, moving from generating simple, low-resolution black and white images like digits and faces in 2014 to creating high-resolution, photorealistic color images today. This course aims to demystify these advanced image generation models, focusing on two primary goals: understanding the underlying paradigms like diffusion, score matching, and flow matching, and learning how these models are trained and evaluated. The lecture specifically delves into diffusion models, explaining their intuitive concept and mathematical underpinnings. Prerequisites include linear algebra, probability theory, differential equations, and basic machine learning knowledge.

The diffusion process: from noise to image

Diffusion models generate images by starting from pure noise, typically sampled from a Gaussian distribution, and iteratively refining it into a coherent image. The rationale for starting with noise is threefold: it's easy to sample, it introduces necessary randomness for generating diverse outputs, and Gaussian distributions possess desirable mathematical properties that simplify complex calculations. This process can be analogized to a sculptor starting with a rough block of stone and gradually chiseling away to reveal the final form. The core idea is to learn a 'denoising' process that reverses the noise addition.

Forward and reverse processes in DDPMs

The Diffusion Probabilistic Model (DDPM) paper, a landmark in the field, outlines a two-stage process. The forward process gradually adds Gaussian noise to a clean image (x0) over a series of timesteps (T), transforming it into pure noise (xT). This process is defined by the model and is designed to be tractable. Mathematically, each step XT is defined by a weighted combination of the previous image XT-1 and added noise, controlled by variance schedules (beta_t). The reverse process, which is what the model learns, aims to reverse this by starting from noise (xT) and progressively denoising it back to a clean image (x0). This sequential denoising is the core of image generation using diffusion models.

Mathematical formulation of the forward process

The forward process, denoted by Q, allows for direct sampling of a noisy image at any timestep 't' (xT) from the initial clean image (x0). This is achieved by using the property that the sum of independent Gaussian distributions is also Gaussian. By defining alpha_t = 1 - beta_t and alpha_bar_t as the cumulative product of alpha values up to t, any noisy image xT can be expressed as a function of x0, a standard Gaussian noise sample (epsilon), and the alpha_bar_t parameter: xT = sqrt(alpha_bar_t) * x0 + sqrt(1 - alpha_bar_t) * epsilon. This direct sampling capability is crucial for efficient training.

Deriving the training objective: The ELBO and loss function

The goal is to train a model (parameterized by theta) to maximize the probability of observing the training data (p_theta(x0)). This is mathematically intractable directly. Instead, the training objective is formulated using the Evidence Lower Bound (ELBO), which is a lower bound on the log-likelihood. By involving the known forward process Q, and applying a trick similar to Jensen's inequality, the ELBO can be related to KL divergences between distributions. Specifically, it boils down to minimizing the KL divergence between two Gaussian distributions. One represents the true posterior of denoising (Q(x_{t-1} | x_t, x_0)), and the other is the model's approximation (p_theta(x_{t-1} | x_t)).

Simplifying the loss for practical training

The KL divergence between the two Gaussian distributions simplifies significantly. Given that the forward process is known and tractable, and the model is assumed to output a Gaussian for the reverse step, the KL divergence can be analytically computed. This computation reveals that the loss function essentially becomes an L2 regression problem. The model (epsilon_theta) is trained to predict the noise (epsilon) that was added to the clean image (x0) to produce the noisy image (xT), using the noisy image and the timestep 't' as input. The loss is the mean squared error between the predicted noise and the actual added noise: `Loss = E_{t, x0, epsilon} [ || epsilon - epsilon_theta(xt, t) ||^2 ]`.

Accelerating inference with DDIM

A major drawback of DDPMs is the slow inference time, requiring thousands of sequential denoising steps. Denoising Diffusion Implicit Models (DDIM) address this by developing a generation process that can skip steps. DDIM modifies the forward and reverse processes to allow for larger jumps between timesteps, effectively reducing the number of required steps. This is achieved by proposing a family of generative models where the reverse step can be made deterministic (setting variance to zero) or controlled by a hyperparameter sigma, allowing for significant speedups.

The trade-off between speed and quality in DDIM

DDIM enables a speedup by using a smaller number of steps (s) compared to the original T steps of DDPM, with the speedup defined as T/s. While this significantly reduces inference time, it introduces a trade-off between speed and image quality. Experiments show that a speedup of up to 20x can be achieved with minimal degradation in metrics like FID (Fréchet Inception Distance). The core idea is to still match the marginal distributions of DDPM but allow for a deterministic or less stochastic generation path, enabling larger skips. This 'implicit' diffusion model relies on the initial noise sample for stochasticity, rather than sequential steps. Typical speedup factors range from 10x to 100x.

Common Questions

The primary goal of CME 296 is to understand the paradigms behind generating images and how the underlying image generation models are trained and evaluated. It aims to explain what makes these models work so effectively. (Timestamp: 117)

Topics

Mentioned in this video

Concepts
CME 295

A similar class offered for Large Language Models (LLMs), used as a reference for the exam format.

Ordinary Differential Equations

A prerequisite for the class, used to understand how differential equations matter and how they are solved.

KL Divergence

A measure of how one probability distribution differs from a second, reference probability distribution, used to quantify the difference between the modeled and target distributions.

Transformer

Current convergence point for image generation architectures, moving towards transformer-based designs like the Diffusion Transformer.

Large Language Models

Multi-modal LLMs that can understand and evaluate images are an emerging area of research for image evaluation.

Fréchet Inception Distance

A metric used to quantify the quality of generated images, which will be covered in lecture seven.

Flow matching

One of the main image generation paradigms covered in lectures one, two, and three.

Bayes' Rule

A fundamental concept in probability theory used to derive the tractable posterior distribution Q(Xt-1 | Xt, X0) in the reverse process.

probability theory

A prerequisite for the class, covering Bayes' rule, conditional and marginal probabilities, expectation, covariance matrix, and Gaussian distributions.

Jensen's inequality

A mathematical inequality used to derive the Evidence Lower Bound (ELBO) for intractable probability calculations in diffusion models.

Score Matching

One of the main image generation paradigms covered in lectures one, two, and three.

CME 296

The course on diffusion and large vision models being taught by Afin and Shervin.

Gaussian Distribution

A normal distribution from which noise is sampled, known for its nice properties that simplify the math in diffusion models.

linear algebra

A prerequisite for the class, including understanding vectors, matrices, operations, gradients, and divergence of vector fields.

Stochastic Differential Equations

A prerequisite for the class, showing how they are used and solved.

diffusion models

One of the main paradigms for image generation covered in lectures one, two, and three, focusing on how they work and their underlying mechanisms.

Machine Learning

Basic understanding of ML, neural networks, training, and inference is a prerequisite for the course.

More from Stanford Online

View all 16 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free