What are the key prerequisites for taking this class?

The class has strong prerequisites including linear algebra (vectors, matrices, gradients), probability theory (Bayes' rule, Gaussian distributions), differential equations (ODEs, SDEs), and basics of machine learning (neural networks, training, inference). (Timestamp: 212)

How are the exams structured for CME 296?

The class has no homework but includes a midterm and a final exam. These are pen-and-paper exams focusing on general formulas and the intuition behind concepts, rather than granular details. (Timestamp: 409)

Why do image generation models typically start from noise?

Image generation models start from noise because it's easy to sample from (e.g., Gaussian distribution), introduces randomness for diverse outputs, and Gaussian distributions have properties that simplify the underlying mathematical derivations. (Timestamp: 1138)

What is the 'forward process' in diffusion models?

The forward process involves progressively corrupting a clean image from the training set by adding noise over several steps. This process is defined by us, where an image at step t is produced by adding noise to an image from step t-1 using defined weights. (Timestamp: 1505)

How are images represented mathematically in this framework?

In this framework, images are represented as vectors, where each element corresponds to a pixel's RGB value. This allows for mathematical operations within vector spaces, simplifying calculations related to noise and distributions. (Timestamp: 1688)

What is the purpose of the 'noise schedule'?

The noise schedule refers to coefficients that define how much noise is added at each step of the forward process. It's usually gradually increased, adding less noise initially to preserve fine details and more noise later for general shape. (Timestamp: 2128)

What does the Evidence Lower Bound (ELBO) signify in diffusion models?

The ELBO is a mathematical trick used to find a tractable lower bound for the intractable log-likelihood of generating clean images. Maximizing this lower bound helps in optimizing the model parameters effectively. (Timestamp: 3474)

How is the DDPM loss function derived and what does it optimize?

The DDPM loss function is derived from simplifying the KL divergence between a known forward distribution and the learnable reverse distribution. It results in an L2 regression task where the model learns to predict and remove the noise that was added to a clean image. (Timestamp: 4726)

What is the main challenge with DDPM inference and how does DDIM address it?

DDPM inference is computationally expensive because it requires many sequential steps (T times, often 1000). DDIM addresses this by making the reverse (denoising) process deterministic, allowing for significant skipping of steps and thus faster image generation. (Timestamp: 5406)

Key Moments

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 1 - Diffusion

Stanford Online

Education5 min read107 min video

Apr 10, 2026|3,668 views|269|10

Stanford Stanford Online Large Language Models LLM

Save to Pod

Key Moments

TL;DR

Diffusion models can generate photorealistic images by iteratively removing noise, but they require thousands of steps, making inference slow, though techniques like DDIM can speed this up significantly.

Key Insights

Image generation models have progressed from low-resolution, black and white images in 2014 to high-resolution, colored images generated by models like Chipt.

The DDPM paper introduced a successful diffusion paradigm for images, involving a forward process of adding noise and a reverse process of denoising.

In the diffusion process, starting from a Gaussian noise distribution simplifies mathematical formulations due to its favorable properties.

The DDPM training loss is derived from maximizing a lower bound (ELBO) of the data's probability, which simplifies to an L2 regression on the added noise.

DDIM (Denoising Diffusion Implicit Models) accelerates inference by allowing larger steps between noise levels, reducing the number of required sequential denoising operations from thousands to tens or hundreds.

While DDIM significantly speeds up image generation, there's a trade-off between speed and generation quality, with typical speedups ranging from 10x to 100x.

The evolution and goals of image generation models

The field of image generation has seen remarkable progress, moving from generating simple, low-resolution black and white images like digits and faces in 2014 to creating high-resolution, photorealistic color images today. This course aims to demystify these advanced image generation models, focusing on two primary goals: understanding the underlying paradigms like diffusion, score matching, and flow matching, and learning how these models are trained and evaluated. The lecture specifically delves into diffusion models, explaining their intuitive concept and mathematical underpinnings. Prerequisites include linear algebra, probability theory, differential equations, and basic machine learning knowledge.

The diffusion process: from noise to image

Diffusion models generate images by starting from pure noise, typically sampled from a Gaussian distribution, and iteratively refining it into a coherent image. The rationale for starting with noise is threefold: it's easy to sample, it introduces necessary randomness for generating diverse outputs, and Gaussian distributions possess desirable mathematical properties that simplify complex calculations. This process can be analogized to a sculptor starting with a rough block of stone and gradually chiseling away to reveal the final form. The core idea is to learn a 'denoising' process that reverses the noise addition.

Forward and reverse processes in DDPMs

The Diffusion Probabilistic Model (DDPM) paper, a landmark in the field, outlines a two-stage process. The forward process gradually adds Gaussian noise to a clean image (x0) over a series of timesteps (T), transforming it into pure noise (xT). This process is defined by the model and is designed to be tractable. Mathematically, each step XT is defined by a weighted combination of the previous image XT-1 and added noise, controlled by variance schedules (beta_t). The reverse process, which is what the model learns, aims to reverse this by starting from noise (xT) and progressively denoising it back to a clean image (x0). This sequential denoising is the core of image generation using diffusion models.

Mathematical formulation of the forward process

The forward process, denoted by Q, allows for direct sampling of a noisy image at any timestep 't' (xT) from the initial clean image (x0). This is achieved by using the property that the sum of independent Gaussian distributions is also Gaussian. By defining alpha_t = 1 - beta_t and alpha_bar_t as the cumulative product of alpha values up to t, any noisy image xT can be expressed as a function of x0, a standard Gaussian noise sample (epsilon), and the alpha_bar_t parameter: xT = sqrt(alpha_bar_t) * x0 + sqrt(1 - alpha_bar_t) * epsilon. This direct sampling capability is crucial for efficient training.

Deriving the training objective: The ELBO and loss function

The goal is to train a model (parameterized by theta) to maximize the probability of observing the training data (p_theta(x0)). This is mathematically intractable directly. Instead, the training objective is formulated using the Evidence Lower Bound (ELBO), which is a lower bound on the log-likelihood. By involving the known forward process Q, and applying a trick similar to Jensen's inequality, the ELBO can be related to KL divergences between distributions. Specifically, it boils down to minimizing the KL divergence between two Gaussian distributions. One represents the true posterior of denoising (Q(x_{t-1} | x_t, x_0)), and the other is the model's approximation (p_theta(x_{t-1} | x_t)).

Simplifying the loss for practical training

The KL divergence between the two Gaussian distributions simplifies significantly. Given that the forward process is known and tractable, and the model is assumed to output a Gaussian for the reverse step, the KL divergence can be analytically computed. This computation reveals that the loss function essentially becomes an L2 regression problem. The model (epsilon_theta) is trained to predict the noise (epsilon) that was added to the clean image (x0) to produce the noisy image (xT), using the noisy image and the timestep 't' as input. The loss is the mean squared error between the predicted noise and the actual added noise: `Loss = E_{t, x0, epsilon} [ || epsilon - epsilon_theta(xt, t) ||^2 ]`.

Accelerating inference with DDIM

A major drawback of DDPMs is the slow inference time, requiring thousands of sequential denoising steps. Denoising Diffusion Implicit Models (DDIM) address this by developing a generation process that can skip steps. DDIM modifies the forward and reverse processes to allow for larger jumps between timesteps, effectively reducing the number of required steps. This is achieved by proposing a family of generative models where the reverse step can be made deterministic (setting variance to zero) or controlled by a hyperparameter sigma, allowing for significant speedups.

The trade-off between speed and quality in DDIM

DDIM enables a speedup by using a smaller number of steps (s) compared to the original T steps of DDPM, with the speedup defined as T/s. While this significantly reduces inference time, it introduces a trade-off between speed and image quality. Experiments show that a speedup of up to 20x can be achieved with minimal degradation in metrics like FID (Fréchet Inception Distance). The core idea is to still match the marginal distributions of DDPM but allow for a deterministic or less stochastic generation path, enabling larger skips. This 'implicit' diffusion model relies on the initial noise sample for stochasticity, rather than sequential steps. Typical speedup factors range from 10x to 100x.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

Common Questions

The primary goal of CME 296 is to understand the paradigms behind generating images and how the underlying image generation models are trained and evaluated. It aims to explain what makes these models work so effectively. (Timestamp: 117)

Topics

AI & Machine Learning Technology & Innovation Science & Mathematics Generative AI Neural Networks Image Generation Computer Vision Deep Learning Architectures Stochastic Processes Machine Learning Theory

Mentioned in this video

Companies

Uber

First industry experience for both Afin and Shervin after grad school.

Google

Company where both Afin and Shervin worked after Uber.

Netflix

Current employer for both Afin and Shervin.

Concepts

CME 295

A similar class offered for Large Language Models (LLMs), used as a reference for the exam format.

Ordinary Differential Equations

A prerequisite for the class, used to understand how differential equations matter and how they are solved.

KL Divergence

A measure of how one probability distribution differs from a second, reference probability distribution, used to quantify the difference between the modeled and target distributions.

Transformer

Current convergence point for image generation architectures, moving towards transformer-based designs like the Diffusion Transformer.

Large Language Models

Multi-modal LLMs that can understand and evaluate images are an emerging area of research for image evaluation.

Fréchet Inception Distance

A metric used to quantify the quality of generated images, which will be covered in lecture seven.

Flow matching

One of the main image generation paradigms covered in lectures one, two, and three.

Bayes' Rule

A fundamental concept in probability theory used to derive the tractable posterior distribution Q(Xt-1 | Xt, X0) in the reverse process.

probability theory

A prerequisite for the class, covering Bayes' rule, conditional and marginal probabilities, expectation, covariance matrix, and Gaussian distributions.

Jensen's inequality

A mathematical inequality used to derive the Evidence Lower Bound (ELBO) for intractable probability calculations in diffusion models.

Score Matching

One of the main image generation paradigms covered in lectures one, two, and three.

CME 296

The course on diffusion and large vision models being taught by Afin and Shervin.

Gaussian Distribution

A normal distribution from which noise is sampled, known for its nice properties that simplify the math in diffusion models.

linear algebra

A prerequisite for the class, including understanding vectors, matrices, operations, gradients, and divergence of vector fields.

Stochastic Differential Equations

A prerequisite for the class, showing how they are used and solved.

diffusion models

One of the main paradigms for image generation covered in lectures one, two, and three, focusing on how they work and their underlying mechanisms.

Machine Learning

Basic understanding of ML, neural networks, training, and inference is a prerequisite for the course.

Software & Apps

ChatGPT

Mentioned as an example of a tool that can generate high-resolution colored images, demonstrating progress in image generation.

U-Net

A model architecture that was commonly used in the past for image generation, contrasting with newer transformer-based architectures.

Multi-modal Diffusion Transformer

A variant of the Diffusion Transformer architecture.

Generative Adversarial Networks

Mentioned as an existing image generation model for comparison regarding the computational time for generating a single image.

Diffusion Transformer

A transformer-based architecture commonly used today for image generation models.

Variational Autoencoders

Mentioned as an existing image generation model for comparison regarding the computational time for generating a single image.

Books

DDPM

A landmark paper, 'Denoising Diffusion Probabilistic Models,' that successfully applied the diffusion paradigm to image generation, forming the core of this lecture.

DDIM

A paper introducing Denoising Diffusion Implicit Models, which modify the DDPM process to achieve faster inference by removing stochasticity in intermediate steps.

Organizations

MIT

Afin's alma mater for grad school.

Stanford University

Shervin's alma mater for grad school.