What is the difference between semantic and perceptual similarity in images?

Semantic similarity refers to the global geometry and overall structure, meaning two images roughly represent the same thing (e.g., two teddy bears reading books). Perceptual similarity refers to local details and lower-level texture, meaning two images look the same to the human eye, even with minor differences.

How does an autoencoder work to represent images?

An autoencoder consists of an encoder that compresses an input image into a lower-dimensional latent representation and a decoder that reconstructs the original image from this representation. Its objective is to recover the input without significant information loss.

Why do standard autoencoders often produce blurry images, and how do VAEs address this?

Standard autoencoders produce blurry images because they try to minimize pixel-wise L2 loss, penalizing slight shifts and leading to averaged outputs. VAEs exacerbate this blurriness due to mapping inputs to a distribution rather than a single latent vector. Strategies like perceptual loss and adversarial loss are introduced to combat this by comparing feature maps or using a discriminator to enforce realism.

What is the role of the encoder and decoder when VAEs are combined with diffusion models?

The encoder acts as a low-pass filter, primarily capturing the semantic similarity of images to create a smooth, easily learnable latent space for the diffusion model. The decoder's role is to add back all the low-level, perceptual details to make the final decoded image realistic and high-resolution.

How are text conditions represented for guided image generation?

Text conditions are represented using transformer-based architectures. The input text is first broken down into tokens (tokenization), and then a transformer encoder generates embeddings that capture the semantic meaning of these tokens, often extracted from the end of the encoder layer.

How are image conditions represented using Vision Transformers (ViT)?

Vision Transformers (ViT) extend the transformer architecture to images. Images are decomposed into patches, and embeddings are learned for these patches. Similar to text transformers, a specific embedding (e.g., from a CLS token) at the encoder's output carries the semantic information of the input image.

What is contrastive learning in the context of connecting text and image embeddings, like in CLIP?

Contrastive learning, exemplified by CLIP, aims to create a shared embedding space where similar text and image concepts are grouped together, and dissimilar ones are pushed apart. This is achieved by maximizing the similarity of true image-caption pairs and minimizing that of incorrect pairs within a batch.

What is Classifier-Free Guidance (CFG) and why is it preferred over classifier guidance?

Classifier-Free Guidance (CFG) is a technique for guided generation that implicitly defines a classifier using both conditioned and unconditioned generation networks, eliminating the need to train a separate classifier. This avoids the cost of training an extra classifier and computing its backward pass, though it still requires two forward passes during inference.

What is the purpose of the guidance hyperparameter 'W' in guided generation models?

The guidance hyperparameter 'W' in CFG (and classifier guidance) is used to tune how strongly the generation process follows the provided condition. A higher 'W' pushes the prediction of noise further towards the conditioned sample, ensuring the generated images more closely adhere to the user's prompt or input condition.

Key Moments

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance

Stanford Online

Education5 min read101 min video

Apr 28, 2026|731 views|30|2

Stanford Stanford Online LLM Large Language Models

Save to Pod

Key Moments

TL;DR

Variational Autoencoders (VAEs) compress images into a structured meaningful latent space, but often produce blurry outputs, requiring perceptual or adversarial losses to fix.

Key Insights

Pixel space representations are high-dimensional (e.g., ~1 million dimensions for 1024x1024 images) and lack compactness and meaningful structure, making them unsuitable for generation models.

Autoencoders use an encoder to create a lower-dimensional latent representation and a decoder to reconstruct the image, aiming to minimize reconstruction loss but not necessarily structuring the latent space.

Variational Autoencoders (VAEs) enforce a structure on the latent space by mapping inputs to a probability distribution (mean and variance) and regularizing it towards a prior distribution (typically standard normal), controlled by a coefficient.

The VAE loss function comprises a reconstruction loss (pixel-wise L2) and a KL divergence regularization loss, which can lead to blurry reconstructions due to averaging pixel values to minimize penalization.

Perceptual and adversarial losses, inspired by LPIPS and GANs respectively, can be added to VAEs to combat blurriness by comparing feature maps or using a discriminator to improve realism.

Classifier-free guidance in diffusion models allows conditioning on user prompts (like text) without retraining the base model, by either using a classifier on noisy images or by interpolating between conditioned and unconditioned noise predictions during inference.

The limitations of pixel space for image generation

The lecture begins by highlighting the challenges of working directly in pixel space for image generation tasks. While intuitive, pixel space is characterized by extremely high dimensionality (e.g., a 1024x1024 image has over a million dimensions for RGB pixels). This high dimensionality is not only computationally intensive but also inefficient due to significant redundancies in pixel values. Furthermore, random perturbations in pixel space often result in images that are semantically meaningless. The ideal representation would be more tractable, compact, and meaningful, with clustered regions representing valid image variations, unlike the 'spiky' representations found in unstructured pixel space.

Autoencoders: Compressing images into a latent space

To address the limitations of pixel space, autoencoders are introduced. An autoencoder consists of an encoder that compresses an input image into a lower-dimensional latent representation (the 'bottleneck') and a decoder that reconstructs the image from this latent representation. The primary goal is to minimize the reconstruction loss, ensuring the output closely matches the input. While this process effectively reduces dimensionality and creates a more compact representation, standard autoencoders do not inherently impose a structured or meaningful organization on the latent space. The model is solely incentivized to reconstruct, potentially leading to a latent space that is not smooth or conducive to generative processes like diffusion.

Variational autoencoders for structured latent spaces

Variational Autoencoders (VAEs) build upon autoencoders by introducing a probabilistic approach to the latent space. Instead of mapping an input image directly to a single latent vector, the encoder in a VAE outputs parameters (mean and standard deviation) for a probability distribution (typically Gaussian) from which the latent representation is sampled. This distribution is then regularized to approximate a prior distribution, often a standard normal distribution. This regularization is crucial, as it incentivizes the latent space to be structured, compact, and meaningful, fulfilling the wish list established earlier. The trade-off between reconstruction accuracy and latent space structure is controlled by a coefficient associated with the KL divergence loss term.

The blurriness problem and its solutions

Despite structuring the latent space, VAEs often suffer from a blurriness artifact in their reconstructions. This occurs because the pixel-wise L2 reconstruction loss heavily penalizes even small pixel shifts. To avoid this penalty, the model tends to average pixel values, resulting in smoother but less detailed images. This blurriness is more pronounced in VAEs than in standard autoencoders due to the inherent uncertainty introduced by sampling from the latent distribution. Two strategies are proposed to combat this: perceptual loss (e.g., LPIPS), which compares images based on feature maps extracted by pre-trained networks, making it more robust to pixel shifts and sensitive to semantic content; and adversarial loss, where a discriminator network tries to distinguish between generated and real images, pushing the decoder to produce more realistic outputs.

Representing conditions: text and image embeddings

The lecture transitions to how conditions, such as text prompts or input images, can be incorporated into generation models. For text, transformer-based architectures are used, which employ tokenization and attention mechanisms to create rich embeddings. These embeddings capture the semantic meaning of words and sentences. For images, Vision Transformers (ViTs) adapt this concept by treating image patches as tokens, learning embeddings for them, and using self-attention to process these representations. The goal is to obtain semantic embeddings for both text and images. The CLIP model is highlighted as a method that jointly learns text and image embeddings by projecting them into a shared space and training with a contrastive loss, where similar text-image pairs are pulled together and dissimilar ones are pushed apart.

Guidance for conditional generation

Finally, the lecture discusses how to guide the generation process using these conditions. One approach involves using Bayesian rule or classifier-free guidance. Classifier-free guidance is presented as a more efficient method that avoids training a separate classifier. It works by training the diffusion model to predict noise under both conditional and unconditional settings. During inference, the model interpolates between the unconditional noise prediction and the conditional noise prediction, controlled by a guidance scale factor (W). This factor determines how strongly the generation adheres to the condition. By randomly dropping the condition during training (e.g., 10-20% of the time), the model learns to capture both unconditional and conditional signals, enabling effective guided generation without an auxiliary classifier, though it does require two forward passes per denoising step.

Mentioned in This Episode

●Companies

●Organizations

●Concepts

Common Questions

Using a latent space (like in VAEs) makes image representation more tractable, compact, and meaningful, leading to smoother and more structured areas where valid images can be generated, ultimately reducing computational burden for diffusion models.

Topics

AI & Machine Learning Technology & Innovation Science & Mathematics Generative AI Image Generation Diffusion Models Computer Vision Latent Space Variational Autoencoders Transformer Networks

Mentioned in this video

Concepts

Jensen's inequality

A mathematical inequality used to derive the lower bound of the VAE loss function, allowing for computation despite intractable integrals.

Score Matching

A generation paradigm that derives the reverse noise process by knowing the score function, offering a continuous version of noise reversal using stochastic differential equations.

Flow matching

A generation paradigm that views the process as transporting probability mass from an initial to a target distribution by predicting a vector field or velocity.

Perceptual Loss

A loss function strategy that combats blurriness in VAE outputs by comparing feature maps of images rather than pixel-wise, making it less sensitive to minor spatial shifts.

Latent Diffusion Model

A diffusion model that operates in the latent space of a VAE, making it computationally cheaper and able to scale, while preserving image details.

Classifier-Free Guidance

A technique for guided generation that defines an implicit classifier based on conditioned and unconditioned generation networks, removing the need for a separate classifier.

Adversarial Loss

A strategy using a discriminator network to distinguish between real and generated images, penalizing the generator for blurry or fake-looking outputs to improve realism.

Diffusion Probabilistic Models

A generation paradigm where clean images are noisified discretely, and a model learns to reconstruct them by predicting the added noise using an L2 loss.

autoencoder

A neural network architecture with an encoder and decoder, designed to learn a lower-dimensional (latent) representation of input data and reconstruct the original input, minimizing information loss.

Convolutional Neural Network

A type of neural network that uses convolutional layers and pooling operations for feature extraction and downsampling, particularly effective for image processing.

Attention Mechanism

A core concept in transformer models, where a piece of input is represented as a function of all other pieces, enabling context-aware embeddings.

Maximum Likelihood Estimation

A statistical method used for deriving loss functions, where the goal is to find model parameters that maximize the probability of the model observing the given data.

Learned Perceptual Image Patch Similarity

A specific metric/loss for perceptual similarity, calculated as the weighted difference between feature maps of two images, tuned to match human perception.

Generative Adversarial Networks

A model composed of a generator (like a VAE decoder) and a discriminator that compete to produce increasingly realistic images and better distinguish between real and fake.

Transformer

An encoder-decoder architecture centered on attention, introduced in 2017, foundational for most modern language and many vision models.

Variational Autoencoder

A generative model where the encoder maps inputs to a probability distribution (mean and variance) in the latent space, forcing the latent space to have a structured, typically Gaussian, form.

Kullback-Leibler (KL) divergence

A term in the VAE loss function that regularizes the encoder's latent distribution to be close to a fixed prior distribution (e.g., standard normal), helping to structure the latent space.

RGB color model

A common color representation system using red, green, and blue intensities for each pixel in an image.

Bottleneck

Refers to the lower-dimensional latent space in an autoencoder, where the model is forced to compress information while preserving essential details.

Vision Transformer

A model that applies the transformer architecture to images by learning embeddings on image patches instead of text tokens.

Organizations

CM296

The course code for the Stanford lecture on Diffusion & Large Vision Models.

Companies

Hugging Face

A platform and community that provides pre-trained models and tools for machine learning, often used to access existing models rather than training from scratch.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free