Key Moments
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching
Key Moments
Score matching allows AI to learn data distributions without knowing the normalizing constant, crucial for generative models. However, accurately learning the score in low-density regions remains a challenge.
Key Insights
The 'score' of a probability distribution, defined as the gradient of the log-probability, is a tractable and numerically stable alternative to the gradient of the probability itself for guiding generative models.
Denoising score matching simplifies score estimation by training a model to predict the score of noisy data, leveraging the tractability of the score for Gaussian distributions.
The Noise Conditional Score Network (NCSN) framework learns scores across multiple noise levels, using high noise to guide initial exploration and progressively lower noise for refinement.
The forward diffusion process in DDPM, which adds Gaussian noise to images, is mathematically linked to the score function; the score is proportional to the negative of the added noise.
Continuous formulations using Stochastic Differential Equations (SDEs) unify Denoising Diffusion Probabilistic Models (DDPMs) and Score-Based Generative Models (SGMs), allowing for more flexible and powerful sampling techniques.
Probability Flow Ordinary Differential Equations (PFODs) offer a deterministic alternative to sampling from the reverse SDE, potentially improving sample quality and reducing computational steps by exploiting the linear nature of the drift term.
Understanding the 'score' for generative modeling
The lecture introduces score matching as a powerful paradigm for generative modeling, particularly for creating new data samples (like images) from a complex, unknown data distribution (P_data). The core challenge is to move from simple, easy-to-sample distributions (like Gaussian noise) towards regions of high probability density in the data distribution. While the gradient of the probability density function (gradient of P(x)) indicates the direction of highest increase, it's often intractable due to the normalizing constant and numerical instability at low densities. The solution is to use the 'score' function, which is the gradient of the log of the probability (gradient of log P(x)). This quantity is tractable because the gradient of the normalizing constant is zero, points in the same direction as the data gradient, and is more numerically stable. This lecture frames generative modeling as a process of 'following the score' to move towards higher density regions.
Bridging the gap: Denoising score matching
Directly estimating the score of P_data is still challenging. Denoising score matching offers a practical solution by leveraging a key insight: the score of a noisy version of the data distribution can be learned. Specifically, by adding Gaussian noise to data points from P_data, we create distributions whose scores are analytically computable (e.g., for a Gaussian distribution, the score is simply -(x - mean) / variance). The denoising score matching objective trains a model to predict the score of this noisy distribution. A crucial theoretical result shows that optimizing this objective is equivalent to minimizing a tractable loss function involving the score of the noisy data conditioned on the original data points. This allows us to estimate the score without directly knowing P_data or its score.
Addressing limitations with Noise Conditional Score Networks (NCSN)
A significant challenge arises when using denoising score matching: the accuracy of score estimation is poor in low-density regions of the noisy distribution. This is because the loss function is weighted by the probability of sampling a noisy data point (Q_sigma), meaning low-density regions contribute little to the training. If the initial sampling starts in these sparse areas, the inaccurate score estimates can lead the generative process astray. The Noise Conditional Score Network (NCSN) framework addresses this by learning scores for multiple noise levels (sigmas). The idea is to use samples from highly noised distributions (large sigma) for initial rough guidance, as these distributions are smoother and better cover the entire space. As the process progresses, lower noise levels (smaller sigma) are used to refine the samples and approach the true data distribution, effectively bridging the gap between broad exploration and precise generation.
Connecting score matching and diffusion models
The lecture highlights the deep connection between score matching and diffusion models like DDPM. The forward process in DDPM, which gradually adds Gaussian noise to an image, can be expressed as a Stochastic Differential Equation (SDE). It's shown that the score of the noisy image distribution in DDPM is directly related to the noise added at that step. Specifically, the score is proportional to the negative of the noise. This unification means that DDPMs and score-based models can be viewed as different ways of parameterizing the same generative process. While DDPMs can be characterized as 'variance preserving' (total variance remains around 1) and NCSNs as 'variance exploding' (noise can grow), the continuous-time SDE formulation bridges these perspectives.
Continuous time: Stochastic Differential Equations (SDEs)
Moving from discrete noise steps to continuous-time SDEs offers significant advantages. This continuous formulation allows for a more flexible and mathematically richer framework, leveraging existing tools from differential equations. The forward SDE describes how data is noised, typically involving a 'drift' term (deterministic) and a 'diffusion' term (stochastic). The key insight is that the reverse of this SDE, which is required for generation (denoising), can be formulated using this drift, diffusion, and crucially, the score function learned by the model. This provides a unified view and enables advanced sampling techniques.
Probability Flow ODEs for efficient sampling
While solving the reverse SDE can generate samples, it involves stochasticity that can lead to large numbers of discretization steps and accumulated errors. The lecture introduces Probability Flow Ordinary Differential Equations (PFODs) as a deterministic counterpart. By reformulating the reverse SDE, a new ODE is derived that preserves the probability flow (meaning the overall distribution of generated samples remains consistent with the target distribution) but follows deterministic trajectories. This approach simplifies sampling, potentially leading to higher sample quality and requiring fewer steps to achieve good results because the system's linearity can be exploited, much like DDIM (a deterministic variant of DDPM) improved upon DDPM sampling.
Advanced solvers and the DPM-Solver
To efficiently solve these continuous generative models, advanced numerical solvers are employed. The lecture mentions ODE solvers like Runge-Kutta methods. A more specialized and efficient method introduced is the DPM-Solver. It leverages the fact that certain components of the ODE (like the drift term) are linear in x, allowing for exact analytical solutions for those parts. Discretization is then focused only on the nonlinear terms, significantly reducing the number of function evaluations (and thus computational cost) needed to achieve high-quality samples. This approach demonstrates how deep theoretical understanding of SDEs and ODEs can lead to practical improvements in generative model performance, achieving good results with tens of function evaluations.
Mentioned in This Episode
●Software & Apps
●Books
●Studies Cited
●Concepts
●People Referenced
Common Questions
Score matching is a generative paradigm focusing on estimating the gradient of the log-probability density (the 'score') to guide new sample generation. Unlike DDPM, which predicts noise to remove, score matching directly guides samples towards high-density data regions.
Topics
Mentioned in this video
A first-generation generative paradigm that progressively adds noise to clean images and learns a reverse process to remove it.
A common method for estimating the score function by leveraging the tractability of computing the score of a Gaussian distribution after perturbing images with noise.
A mathematical series used to approximate a function by an infinite sum of terms calculated from the values of the function's derivatives at a single point, applied in DPM-solver 2.
Mathematical properties of the logarithm function are leveraged to make the gradient of the log-probability tractable by canceling out intractable normalizing constants.
An alternative to DDPM that allows for faster sampling by having fewer discrete steps, often mentioned in comparison to ODE-based methods.
Stanford course on diffusion and large vision models.
A sampling technique that starts with a high amount of noise and progressively decreases it, leveraging scores from different noise levels to guide the sample to a clean image.
A differential equation in which one or more of the terms is a stochastic process, used to describe the continuous evolution of noise in generative models.
A type of statistical loss function used in DDPM to minimize the squared difference between predicted and actual noise, and in score matching for score approximation.
A new generation paradigm for generative models that focuses on estimating the gradient of the log-probability density function to guide sampling towards high-density regions.
An equation that expresses the conservation of a quantity, used to derive the Probability Flow ODE from the Fokker-Planck equation.
A type of stochastic process used to model continuous noise increments for transitioning discrete DDPM and NCSN formulations into a continuous framework.
A differential equation containing one or more functions of one independent variable and its derivatives, which provides a deterministic and more stable alternative to SDEs for generative modeling.
A class of algorithms for sampling from a probability distribution by constructing a Markov chain that has the desired distribution as its stationary distribution.
A partial differential equation that describes the time evolution of the probability density function of the position of a particle under the influence of fluctuating forces, used in SDE derivation.
A measure of computational complexity for samplers, representing the number of forward passes performed on a model, which researchers aim to minimize.
A deterministic ordinary differential equation derived from the continuity equation, which preserves probability flow and can be used for faster and more stable sampling in generative models.
A framework that learns scores of data distributions at several noise levels to guide sampling, starting with high noise and progressively decreasing it.
A fancier technique for solving ODEs that provides more precise results than Euler's method by looking at midpoints and endpoints to measure derivatives.
A numerical method for solving stochastic differential equations, used for inference in SDE-based generative models to go from a noisy input to a final image.
A method for solving ODEs in generative models that leverages the linearity of part of the equation, allowing for more efficient and accurate sampling with fewer function evaluations.
A simple numerical method for approximating solutions to ordinary differential equations, known for its imprecision despite low computational cost.
A sampling method, part of MCMC, that uses the score function and a stochastic term to explore regions of higher density in a probability distribution, ensuring diversity in generated samples.
More from Stanford Online
View all 19 summaries
78 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 2: PyTorch (einops)
80 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization
74 minStanford AA228V I Validation of Safety Critical Systems I Explainability
54 minStanford Robotics Seminar ENGR319 | Winter 2026 | Gen Control, Action Chunking, Moravec’s Paradox
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free