Why is it difficult to estimate the gradient of a probability density function directly?

Directly estimating the gradient of a PDF is hard because PDFs are normalized, involving an intractable normalizing constant. Additionally, they can have very low values in sparse data regions, leading to numerical instability.

What is the 'score' in score matching and why is it preferred over a simple gradient?

The 'score' is the gradient of the log of the probability density function. It's preferred because the logarithm's properties make it tractable (the normalizing constant's gradient is zero), it points in the same direction as the raw gradient, and it's more numerically stable.

How does Langevin sampling contribute to generative models?

Langevin sampling is a method that, when given the score, not only guides samples towards higher density regions but also includes a stochastic term. This stochasticity ensures diversity in the generated samples, preventing them from always converging to the highest density peak.

What is Denoising Score Matching (DSM) and why is it important?

Denoising Score Matching (DSM) is a key method to estimate the score function. It relies on the observation that the score of a Gaussian distribution is analytically tractable. DSM perturbs data with Gaussian noise, allowing for the optimization of a tractable loss function that helps estimate the score of noisy data.

What is the trade-off in choosing the noise level in score matching?

There's a trade-off: too little noise leads to poor score estimations in low-density regions, making it hard to guide samples effectively from sparse areas. Too much noise makes the noisy distribution too different from the true data distribution, so what's learned isn't exactly what's wanted for clean data.

How do Noise Conditional Score Networks (NCSN) address the noise level trade-off?

NCSN tackles the trade-off by learning scores for multiple distributions at different noise levels. It uses annealed Langevin dynamics, starting with a heavily noised distribution for rough guidance and gradually decreasing the noise level to refine samples as they approach the true data distribution.

How are DDPM and Score Matching unified through Stochastic Differential Equations (SDEs)?

A unified view connects DDPM and Score Matching through SDEs, which describe the continuous evolution of noise. This allows both discrete methods to be expressed in a continuous framework, leveraging techniques from the field of differential equations to solve both the forward (noising) and reverse (denoising) processes.

What makes Probability Flow ODEs (PFODs) an improvement over reverse SDEs for sampling?

PFODs are ordinary differential equations that eliminate the stochastic term found in reverse SDEs, making the sampling process deterministic once an initial condition is set. This leads to less accumulated error from stochasticity and allows for adaptive step sizes, resulting in faster and often higher-quality sample generation.

How does DPM-Solver optimize the resolution of Probability Flow ODEs?

DPM-Solver improves ODE resolution by leveraging the fact that parts of the PFOD are linear. It computes an exact solution for the linear part and focuses discretization efforts on the non-linear term (the score approximated by a neural network). This reduces function evaluations and obtains better quality results, especially at low step counts.

Key Moments

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching

Stanford Online

Education5 min read109 min video

Apr 14, 2026|1,780 views|98|5

Stanford Stanford Online Large Language Models LLM

Save to Pod

Key Moments

TL;DR

Score matching allows AI to learn data distributions without knowing the normalizing constant, crucial for generative models. However, accurately learning the score in low-density regions remains a challenge.

Key Insights

The 'score' of a probability distribution, defined as the gradient of the log-probability, is a tractable and numerically stable alternative to the gradient of the probability itself for guiding generative models.

Denoising score matching simplifies score estimation by training a model to predict the score of noisy data, leveraging the tractability of the score for Gaussian distributions.

The Noise Conditional Score Network (NCSN) framework learns scores across multiple noise levels, using high noise to guide initial exploration and progressively lower noise for refinement.

The forward diffusion process in DDPM, which adds Gaussian noise to images, is mathematically linked to the score function; the score is proportional to the negative of the added noise.

Continuous formulations using Stochastic Differential Equations (SDEs) unify Denoising Diffusion Probabilistic Models (DDPMs) and Score-Based Generative Models (SGMs), allowing for more flexible and powerful sampling techniques.

Probability Flow Ordinary Differential Equations (PFODs) offer a deterministic alternative to sampling from the reverse SDE, potentially improving sample quality and reducing computational steps by exploiting the linear nature of the drift term.

Understanding the 'score' for generative modeling

The lecture introduces score matching as a powerful paradigm for generative modeling, particularly for creating new data samples (like images) from a complex, unknown data distribution (P_data). The core challenge is to move from simple, easy-to-sample distributions (like Gaussian noise) towards regions of high probability density in the data distribution. While the gradient of the probability density function (gradient of P(x)) indicates the direction of highest increase, it's often intractable due to the normalizing constant and numerical instability at low densities. The solution is to use the 'score' function, which is the gradient of the log of the probability (gradient of log P(x)). This quantity is tractable because the gradient of the normalizing constant is zero, points in the same direction as the data gradient, and is more numerically stable. This lecture frames generative modeling as a process of 'following the score' to move towards higher density regions.

Bridging the gap: Denoising score matching

Directly estimating the score of P_data is still challenging. Denoising score matching offers a practical solution by leveraging a key insight: the score of a noisy version of the data distribution can be learned. Specifically, by adding Gaussian noise to data points from P_data, we create distributions whose scores are analytically computable (e.g., for a Gaussian distribution, the score is simply -(x - mean) / variance). The denoising score matching objective trains a model to predict the score of this noisy distribution. A crucial theoretical result shows that optimizing this objective is equivalent to minimizing a tractable loss function involving the score of the noisy data conditioned on the original data points. This allows us to estimate the score without directly knowing P_data or its score.

Addressing limitations with Noise Conditional Score Networks (NCSN)

A significant challenge arises when using denoising score matching: the accuracy of score estimation is poor in low-density regions of the noisy distribution. This is because the loss function is weighted by the probability of sampling a noisy data point (Q_sigma), meaning low-density regions contribute little to the training. If the initial sampling starts in these sparse areas, the inaccurate score estimates can lead the generative process astray. The Noise Conditional Score Network (NCSN) framework addresses this by learning scores for multiple noise levels (sigmas). The idea is to use samples from highly noised distributions (large sigma) for initial rough guidance, as these distributions are smoother and better cover the entire space. As the process progresses, lower noise levels (smaller sigma) are used to refine the samples and approach the true data distribution, effectively bridging the gap between broad exploration and precise generation.

Connecting score matching and diffusion models

The lecture highlights the deep connection between score matching and diffusion models like DDPM. The forward process in DDPM, which gradually adds Gaussian noise to an image, can be expressed as a Stochastic Differential Equation (SDE). It's shown that the score of the noisy image distribution in DDPM is directly related to the noise added at that step. Specifically, the score is proportional to the negative of the noise. This unification means that DDPMs and score-based models can be viewed as different ways of parameterizing the same generative process. While DDPMs can be characterized as 'variance preserving' (total variance remains around 1) and NCSNs as 'variance exploding' (noise can grow), the continuous-time SDE formulation bridges these perspectives.

Continuous time: Stochastic Differential Equations (SDEs)

Moving from discrete noise steps to continuous-time SDEs offers significant advantages. This continuous formulation allows for a more flexible and mathematically richer framework, leveraging existing tools from differential equations. The forward SDE describes how data is noised, typically involving a 'drift' term (deterministic) and a 'diffusion' term (stochastic). The key insight is that the reverse of this SDE, which is required for generation (denoising), can be formulated using this drift, diffusion, and crucially, the score function learned by the model. This provides a unified view and enables advanced sampling techniques.

Probability Flow ODEs for efficient sampling

While solving the reverse SDE can generate samples, it involves stochasticity that can lead to large numbers of discretization steps and accumulated errors. The lecture introduces Probability Flow Ordinary Differential Equations (PFODs) as a deterministic counterpart. By reformulating the reverse SDE, a new ODE is derived that preserves the probability flow (meaning the overall distribution of generated samples remains consistent with the target distribution) but follows deterministic trajectories. This approach simplifies sampling, potentially leading to higher sample quality and requiring fewer steps to achieve good results because the system's linearity can be exploited, much like DDIM (a deterministic variant of DDPM) improved upon DDPM sampling.

Advanced solvers and the DPM-Solver

To efficiently solve these continuous generative models, advanced numerical solvers are employed. The lecture mentions ODE solvers like Runge-Kutta methods. A more specialized and efficient method introduced is the DPM-Solver. It leverages the fact that certain components of the ODE (like the drift term) are linear in x, allowing for exact analytical solutions for those parts. Discretization is then focused only on the nonlinear terms, significantly reducing the number of function evaluations (and thus computational cost) needed to achieve high-quality samples. This approach demonstrates how deep theoretical understanding of SDEs and ODEs can lead to practical improvements in generative model performance, achieving good results with tens of function evaluations.

Mentioned in This Episode

●Software & Apps

●Books

●Studies Cited

●Concepts

●People Referenced

Common Questions

Score matching is a generative paradigm focusing on estimating the gradient of the log-probability density (the 'score') to guide new sample generation. Unlike DDPM, which predicts noise to remove, score matching directly guides samples towards high-density data regions.

Topics

AI & Machine Learning Programming & Software Science & Mathematics Neural Networks Image Generation Diffusion Models Differential Equations Generative Models Stochastic Processes Machine Learning Theory Score-based Models

Mentioned in this video

Concepts

Diffusion Probabilistic Models

A first-generation generative paradigm that progressively adds noise to clean images and learns a reverse process to remove it.

Denoising Score Matching

A common method for estimating the score function by leveraging the tractability of computing the score of a Gaussian distribution after perturbing images with noise.

Taylor's expansion

A mathematical series used to approximate a function by an infinite sum of terms calculated from the values of the function's derivatives at a single point, applied in DPM-solver 2.

Logarithm function

Mathematical properties of the logarithm function are leveraged to make the gradient of the log-probability tractable by canceling out intractable normalizing constants.

Denoising Diffusion Implicit Models

An alternative to DDPM that allows for faster sampling by having fewer discrete steps, often mentioned in comparison to ODE-based methods.

CME 296

Stanford course on diffusion and large vision models.

Annealed Langevin Dynamics

A sampling technique that starts with a high amount of noise and progressively decreases it, leveraging scores from different noise levels to guide the sample to a clean image.

Stochastic Differential Equation

A differential equation in which one or more of the terms is a stochastic process, used to describe the continuous evolution of noise in generative models.

L2 Regression

A type of statistical loss function used in DDPM to minimize the squared difference between predicted and actual noise, and in score matching for score approximation.

Score Matching

A new generation paradigm for generative models that focuses on estimating the gradient of the log-probability density function to guide sampling towards high-density regions.

Continuity equation

An equation that expresses the conservation of a quantity, used to derive the Probability Flow ODE from the Fokker-Planck equation.

Wiener process

A type of stochastic process used to model continuous noise increments for transitioning discrete DDPM and NCSN formulations into a continuous framework.

Ordinary Differential Equation

A differential equation containing one or more functions of one independent variable and its derivatives, which provides a deterministic and more stable alternative to SDEs for generative modeling.

Markov chain Monte Carlo

A class of algorithms for sampling from a probability distribution by constructing a Markov chain that has the desired distribution as its stationary distribution.

Fokker–Planck equation

A partial differential equation that describes the time evolution of the probability density function of the position of a particle under the influence of fluctuating forces, used in SDE derivation.

Function Evaluations

A measure of computational complexity for samplers, representing the number of forward passes performed on a model, which researchers aim to minimize.

Probability Flow ODE

A deterministic ordinary differential equation derived from the continuity equation, which preserves probability flow and can be used for faster and more stable sampling in generative models.

Software & Apps

Noise Conditional Score Network

A framework that learns scores of data distributions at several noise levels to guide sampling, starting with high noise and progressively decreasing it.

Runge-Kutta method

A fancier technique for solving ODEs that provides more precise results than Euler's method by looking at midpoints and endpoints to measure derivatives.

Euler–Maruyama method

A numerical method for solving stochastic differential equations, used for inference in SDE-based generative models to go from a noisy input to a final image.

DPM-Solver

A method for solving ODEs in generative models that leverages the linearity of part of the equation, allowing for more efficient and accurate sampling with fewer function evaluations.

Euler's method

A simple numerical method for approximating solutions to ordinary differential equations, known for its imprecision despite low computational cost.

Langevin sampling

A sampling method, part of MCMC, that uses the score function and a stochastic term to explore regions of higher density in a probability distribution, ensuring diversity in generated samples.

Studies & Research

Slice Score Matching

A method for score estimation that uses random projections of the score.

Implicit Score Matching

A method from 2005 that derived a loss for score estimation without requiring knowledge of the true score function.

Books

Score-Based Generative Modeling through Stochastic Differential Equation

A paper by Yang and co-authors from Stanford that unified the DDPM and score matching paradigms by describing the forward and reverse processes using SDEs.

People

Yang Song

Stanford researcher and co-author of the paper unifying DDPM and score-based generative modeling.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free