Key Moments

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching

Stanford OnlineStanford Online
Education5 min read109 min video
Apr 14, 2026|1,780 views|98|5
Save to Pod
TL;DR

Score matching allows AI to learn data distributions without knowing the normalizing constant, crucial for generative models. However, accurately learning the score in low-density regions remains a challenge.

Key Insights

1

The 'score' of a probability distribution, defined as the gradient of the log-probability, is a tractable and numerically stable alternative to the gradient of the probability itself for guiding generative models.

2

Denoising score matching simplifies score estimation by training a model to predict the score of noisy data, leveraging the tractability of the score for Gaussian distributions.

3

The Noise Conditional Score Network (NCSN) framework learns scores across multiple noise levels, using high noise to guide initial exploration and progressively lower noise for refinement.

4

The forward diffusion process in DDPM, which adds Gaussian noise to images, is mathematically linked to the score function; the score is proportional to the negative of the added noise.

5

Continuous formulations using Stochastic Differential Equations (SDEs) unify Denoising Diffusion Probabilistic Models (DDPMs) and Score-Based Generative Models (SGMs), allowing for more flexible and powerful sampling techniques.

6

Probability Flow Ordinary Differential Equations (PFODs) offer a deterministic alternative to sampling from the reverse SDE, potentially improving sample quality and reducing computational steps by exploiting the linear nature of the drift term.

Understanding the 'score' for generative modeling

The lecture introduces score matching as a powerful paradigm for generative modeling, particularly for creating new data samples (like images) from a complex, unknown data distribution (P_data). The core challenge is to move from simple, easy-to-sample distributions (like Gaussian noise) towards regions of high probability density in the data distribution. While the gradient of the probability density function (gradient of P(x)) indicates the direction of highest increase, it's often intractable due to the normalizing constant and numerical instability at low densities. The solution is to use the 'score' function, which is the gradient of the log of the probability (gradient of log P(x)). This quantity is tractable because the gradient of the normalizing constant is zero, points in the same direction as the data gradient, and is more numerically stable. This lecture frames generative modeling as a process of 'following the score' to move towards higher density regions.

Bridging the gap: Denoising score matching

Directly estimating the score of P_data is still challenging. Denoising score matching offers a practical solution by leveraging a key insight: the score of a noisy version of the data distribution can be learned. Specifically, by adding Gaussian noise to data points from P_data, we create distributions whose scores are analytically computable (e.g., for a Gaussian distribution, the score is simply -(x - mean) / variance). The denoising score matching objective trains a model to predict the score of this noisy distribution. A crucial theoretical result shows that optimizing this objective is equivalent to minimizing a tractable loss function involving the score of the noisy data conditioned on the original data points. This allows us to estimate the score without directly knowing P_data or its score.

Addressing limitations with Noise Conditional Score Networks (NCSN)

A significant challenge arises when using denoising score matching: the accuracy of score estimation is poor in low-density regions of the noisy distribution. This is because the loss function is weighted by the probability of sampling a noisy data point (Q_sigma), meaning low-density regions contribute little to the training. If the initial sampling starts in these sparse areas, the inaccurate score estimates can lead the generative process astray. The Noise Conditional Score Network (NCSN) framework addresses this by learning scores for multiple noise levels (sigmas). The idea is to use samples from highly noised distributions (large sigma) for initial rough guidance, as these distributions are smoother and better cover the entire space. As the process progresses, lower noise levels (smaller sigma) are used to refine the samples and approach the true data distribution, effectively bridging the gap between broad exploration and precise generation.

Connecting score matching and diffusion models

The lecture highlights the deep connection between score matching and diffusion models like DDPM. The forward process in DDPM, which gradually adds Gaussian noise to an image, can be expressed as a Stochastic Differential Equation (SDE). It's shown that the score of the noisy image distribution in DDPM is directly related to the noise added at that step. Specifically, the score is proportional to the negative of the noise. This unification means that DDPMs and score-based models can be viewed as different ways of parameterizing the same generative process. While DDPMs can be characterized as 'variance preserving' (total variance remains around 1) and NCSNs as 'variance exploding' (noise can grow), the continuous-time SDE formulation bridges these perspectives.

Continuous time: Stochastic Differential Equations (SDEs)

Moving from discrete noise steps to continuous-time SDEs offers significant advantages. This continuous formulation allows for a more flexible and mathematically richer framework, leveraging existing tools from differential equations. The forward SDE describes how data is noised, typically involving a 'drift' term (deterministic) and a 'diffusion' term (stochastic). The key insight is that the reverse of this SDE, which is required for generation (denoising), can be formulated using this drift, diffusion, and crucially, the score function learned by the model. This provides a unified view and enables advanced sampling techniques.

Probability Flow ODEs for efficient sampling

While solving the reverse SDE can generate samples, it involves stochasticity that can lead to large numbers of discretization steps and accumulated errors. The lecture introduces Probability Flow Ordinary Differential Equations (PFODs) as a deterministic counterpart. By reformulating the reverse SDE, a new ODE is derived that preserves the probability flow (meaning the overall distribution of generated samples remains consistent with the target distribution) but follows deterministic trajectories. This approach simplifies sampling, potentially leading to higher sample quality and requiring fewer steps to achieve good results because the system's linearity can be exploited, much like DDIM (a deterministic variant of DDPM) improved upon DDPM sampling.

Advanced solvers and the DPM-Solver

To efficiently solve these continuous generative models, advanced numerical solvers are employed. The lecture mentions ODE solvers like Runge-Kutta methods. A more specialized and efficient method introduced is the DPM-Solver. It leverages the fact that certain components of the ODE (like the drift term) are linear in x, allowing for exact analytical solutions for those parts. Discretization is then focused only on the nonlinear terms, significantly reducing the number of function evaluations (and thus computational cost) needed to achieve high-quality samples. This approach demonstrates how deep theoretical understanding of SDEs and ODEs can lead to practical improvements in generative model performance, achieving good results with tens of function evaluations.

Common Questions

Score matching is a generative paradigm focusing on estimating the gradient of the log-probability density (the 'score') to guide new sample generation. Unlike DDPM, which predicts noise to remove, score matching directly guides samples towards high-density data regions.

Topics

Mentioned in this video

Concepts
Diffusion Probabilistic Models

A first-generation generative paradigm that progressively adds noise to clean images and learns a reverse process to remove it.

Denoising Score Matching

A common method for estimating the score function by leveraging the tractability of computing the score of a Gaussian distribution after perturbing images with noise.

Taylor's expansion

A mathematical series used to approximate a function by an infinite sum of terms calculated from the values of the function's derivatives at a single point, applied in DPM-solver 2.

Logarithm function

Mathematical properties of the logarithm function are leveraged to make the gradient of the log-probability tractable by canceling out intractable normalizing constants.

Denoising Diffusion Implicit Models

An alternative to DDPM that allows for faster sampling by having fewer discrete steps, often mentioned in comparison to ODE-based methods.

CME 296

Stanford course on diffusion and large vision models.

Annealed Langevin Dynamics

A sampling technique that starts with a high amount of noise and progressively decreases it, leveraging scores from different noise levels to guide the sample to a clean image.

Stochastic Differential Equation

A differential equation in which one or more of the terms is a stochastic process, used to describe the continuous evolution of noise in generative models.

L2 Regression

A type of statistical loss function used in DDPM to minimize the squared difference between predicted and actual noise, and in score matching for score approximation.

Score Matching

A new generation paradigm for generative models that focuses on estimating the gradient of the log-probability density function to guide sampling towards high-density regions.

Continuity equation

An equation that expresses the conservation of a quantity, used to derive the Probability Flow ODE from the Fokker-Planck equation.

Wiener process

A type of stochastic process used to model continuous noise increments for transitioning discrete DDPM and NCSN formulations into a continuous framework.

Ordinary Differential Equation

A differential equation containing one or more functions of one independent variable and its derivatives, which provides a deterministic and more stable alternative to SDEs for generative modeling.

Markov chain Monte Carlo

A class of algorithms for sampling from a probability distribution by constructing a Markov chain that has the desired distribution as its stationary distribution.

Fokker–Planck equation

A partial differential equation that describes the time evolution of the probability density function of the position of a particle under the influence of fluctuating forces, used in SDE derivation.

Function Evaluations

A measure of computational complexity for samplers, representing the number of forward passes performed on a model, which researchers aim to minimize.

Probability Flow ODE

A deterministic ordinary differential equation derived from the continuity equation, which preserves probability flow and can be used for faster and more stable sampling in generative models.

More from Stanford Online

View all 19 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free