Key Moments
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance
Key Moments
Variational Autoencoders (VAEs) compress images into a structured meaningful latent space, but often produce blurry outputs, requiring perceptual or adversarial losses to fix.
Key Insights
Pixel space representations are high-dimensional (e.g., ~1 million dimensions for 1024x1024 images) and lack compactness and meaningful structure, making them unsuitable for generation models.
Autoencoders use an encoder to create a lower-dimensional latent representation and a decoder to reconstruct the image, aiming to minimize reconstruction loss but not necessarily structuring the latent space.
Variational Autoencoders (VAEs) enforce a structure on the latent space by mapping inputs to a probability distribution (mean and variance) and regularizing it towards a prior distribution (typically standard normal), controlled by a coefficient.
The VAE loss function comprises a reconstruction loss (pixel-wise L2) and a KL divergence regularization loss, which can lead to blurry reconstructions due to averaging pixel values to minimize penalization.
Perceptual and adversarial losses, inspired by LPIPS and GANs respectively, can be added to VAEs to combat blurriness by comparing feature maps or using a discriminator to improve realism.
Classifier-free guidance in diffusion models allows conditioning on user prompts (like text) without retraining the base model, by either using a classifier on noisy images or by interpolating between conditioned and unconditioned noise predictions during inference.
The limitations of pixel space for image generation
The lecture begins by highlighting the challenges of working directly in pixel space for image generation tasks. While intuitive, pixel space is characterized by extremely high dimensionality (e.g., a 1024x1024 image has over a million dimensions for RGB pixels). This high dimensionality is not only computationally intensive but also inefficient due to significant redundancies in pixel values. Furthermore, random perturbations in pixel space often result in images that are semantically meaningless. The ideal representation would be more tractable, compact, and meaningful, with clustered regions representing valid image variations, unlike the 'spiky' representations found in unstructured pixel space.
Autoencoders: Compressing images into a latent space
To address the limitations of pixel space, autoencoders are introduced. An autoencoder consists of an encoder that compresses an input image into a lower-dimensional latent representation (the 'bottleneck') and a decoder that reconstructs the image from this latent representation. The primary goal is to minimize the reconstruction loss, ensuring the output closely matches the input. While this process effectively reduces dimensionality and creates a more compact representation, standard autoencoders do not inherently impose a structured or meaningful organization on the latent space. The model is solely incentivized to reconstruct, potentially leading to a latent space that is not smooth or conducive to generative processes like diffusion.
Variational autoencoders for structured latent spaces
Variational Autoencoders (VAEs) build upon autoencoders by introducing a probabilistic approach to the latent space. Instead of mapping an input image directly to a single latent vector, the encoder in a VAE outputs parameters (mean and standard deviation) for a probability distribution (typically Gaussian) from which the latent representation is sampled. This distribution is then regularized to approximate a prior distribution, often a standard normal distribution. This regularization is crucial, as it incentivizes the latent space to be structured, compact, and meaningful, fulfilling the wish list established earlier. The trade-off between reconstruction accuracy and latent space structure is controlled by a coefficient associated with the KL divergence loss term.
The blurriness problem and its solutions
Despite structuring the latent space, VAEs often suffer from a blurriness artifact in their reconstructions. This occurs because the pixel-wise L2 reconstruction loss heavily penalizes even small pixel shifts. To avoid this penalty, the model tends to average pixel values, resulting in smoother but less detailed images. This blurriness is more pronounced in VAEs than in standard autoencoders due to the inherent uncertainty introduced by sampling from the latent distribution. Two strategies are proposed to combat this: perceptual loss (e.g., LPIPS), which compares images based on feature maps extracted by pre-trained networks, making it more robust to pixel shifts and sensitive to semantic content; and adversarial loss, where a discriminator network tries to distinguish between generated and real images, pushing the decoder to produce more realistic outputs.
Representing conditions: text and image embeddings
The lecture transitions to how conditions, such as text prompts or input images, can be incorporated into generation models. For text, transformer-based architectures are used, which employ tokenization and attention mechanisms to create rich embeddings. These embeddings capture the semantic meaning of words and sentences. For images, Vision Transformers (ViTs) adapt this concept by treating image patches as tokens, learning embeddings for them, and using self-attention to process these representations. The goal is to obtain semantic embeddings for both text and images. The CLIP model is highlighted as a method that jointly learns text and image embeddings by projecting them into a shared space and training with a contrastive loss, where similar text-image pairs are pulled together and dissimilar ones are pushed apart.
Guidance for conditional generation
Finally, the lecture discusses how to guide the generation process using these conditions. One approach involves using Bayesian rule or classifier-free guidance. Classifier-free guidance is presented as a more efficient method that avoids training a separate classifier. It works by training the diffusion model to predict noise under both conditional and unconditional settings. During inference, the model interpolates between the unconditional noise prediction and the conditional noise prediction, controlled by a guidance scale factor (W). This factor determines how strongly the generation adheres to the condition. By randomly dropping the condition during training (e.g., 10-20% of the time), the model learns to capture both unconditional and conditional signals, enabling effective guided generation without an auxiliary classifier, though it does require two forward passes per denoising step.
Mentioned in This Episode
●Companies
●Organizations
●Concepts
Common Questions
Using a latent space (like in VAEs) makes image representation more tractable, compact, and meaningful, leading to smoother and more structured areas where valid images can be generated, ultimately reducing computational burden for diffusion models.
Topics
Mentioned in this video
A mathematical inequality used to derive the lower bound of the VAE loss function, allowing for computation despite intractable integrals.
A generation paradigm that derives the reverse noise process by knowing the score function, offering a continuous version of noise reversal using stochastic differential equations.
A generation paradigm that views the process as transporting probability mass from an initial to a target distribution by predicting a vector field or velocity.
A loss function strategy that combats blurriness in VAE outputs by comparing feature maps of images rather than pixel-wise, making it less sensitive to minor spatial shifts.
A diffusion model that operates in the latent space of a VAE, making it computationally cheaper and able to scale, while preserving image details.
A technique for guided generation that defines an implicit classifier based on conditioned and unconditioned generation networks, removing the need for a separate classifier.
A strategy using a discriminator network to distinguish between real and generated images, penalizing the generator for blurry or fake-looking outputs to improve realism.
A generation paradigm where clean images are noisified discretely, and a model learns to reconstruct them by predicting the added noise using an L2 loss.
A neural network architecture with an encoder and decoder, designed to learn a lower-dimensional (latent) representation of input data and reconstruct the original input, minimizing information loss.
A type of neural network that uses convolutional layers and pooling operations for feature extraction and downsampling, particularly effective for image processing.
A core concept in transformer models, where a piece of input is represented as a function of all other pieces, enabling context-aware embeddings.
A statistical method used for deriving loss functions, where the goal is to find model parameters that maximize the probability of the model observing the given data.
A specific metric/loss for perceptual similarity, calculated as the weighted difference between feature maps of two images, tuned to match human perception.
A model composed of a generator (like a VAE decoder) and a discriminator that compete to produce increasingly realistic images and better distinguish between real and fake.
An encoder-decoder architecture centered on attention, introduced in 2017, foundational for most modern language and many vision models.
A generative model where the encoder maps inputs to a probability distribution (mean and variance) in the latent space, forcing the latent space to have a structured, typically Gaussian, form.
A term in the VAE loss function that regularizes the encoder's latent distribution to be close to a fixed prior distribution (e.g., standard normal), helping to structure the latent space.
A common color representation system using red, green, and blue intensities for each pixel in an image.
Refers to the lower-dimensional latent space in an autoencoder, where the model is forced to compress information while preserving essential details.
A model that applies the transformer architecture to images by learning embeddings on image patches instead of text tokens.
More from Stanford Online
View all 32 summaries
87 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA
81 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 8: Parallelism
82 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism
72 minStanford CS25: Transformers United V6 I From Representation Learning to World Modeling
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free