Key Moments

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 6 - Model Training

Stanford OnlineStanford Online
Education5 min read101 min video
May 19, 2026|171 views|22|1
Save to Pod
TL;DR

Training diffusion models for text-to-image generation involves a multi-stage process from pre-training to fine-tuning and distillation, with newer techniques like REPA speeding up training by 18x.

Key Insights

1

The default loss function for diffusion models is shifting towards flow matching.

2

Sampling timestep 't' from a logit normal distribution, rather than uniformly, emphasizes middle steps where denoising is hardest.

3

Representational Alignment (REPA) can speed up diffusion transformer training by an order of magnitude (e.g., 18x) by matching pre-trained encoder representations.

4

Post-training can involve continued training for knowledge expansion or supervised fine-tuning for behavioral improvements like aesthetics and prompt adherence.

5

Distillation techniques, such as progressive distillation and InstaFlow, aim to reduce the number of inference steps for faster image generation, potentially from 1000s to just one.

6

Consistency models and GAN-like objectives offer alternative training approaches that can lead to crisper image generation and more stable training dynamics.

The multi-stage training lifecycle of text-to-image models

The training process for text-to-image generation models is broken down into distinct phases. It begins with **pre-training**, a compute-intensive stage where the model learns to generate general images using vast datasets. This is followed by **post-training**, which refines the model's output and can include **continued training** to expand its knowledge base or **supervised fine-tuning** to improve specific behaviors like aesthetic quality or prompt adherence. An optional **tuning** phase allows for personalization, such as generating images of a specific subject. Finally, **distillation** methods are employed to make the trained model more efficient for production by reducing inference time and computational cost.

Optimizing training with flow matching and timestep sampling

The primary loss function for training diffusion models is increasingly favoring the **flow matching** perspective, which frames image generation as a transport problem. A crucial aspect of training involves sampling the noise level or 'timestep' (t). Instead of a uniform distribution, practical training often utilizes a **logit normal distribution**. This strategy is employed because tasks are easiest at the extremes of the noise spectrum (very noisy or nearly clean) and hardest in the middle. By sampling more frequently from steps where denoising is more challenging, the model's learning is focused on the most critical aspects of the generation process, leading to more robust performance.

Enhancing training efficiency with representation alignment

The **Representation Alignment (REPA)** method offers a significant speed-up in training diffusion transformers. It works by incorporating an additional loss term that encourages the representations learned by the diffusion model's layers to align with those of a pre-trained encoder. This approach acts like providing a 'book' to a learner—it guides the model, enabling it to learn faster and more effectively. Studies show REPA can accelerate training by as much as 18 times, with benefits being more pronounced in earlier layers of the transformer and for larger models. This technique is particularly valuable given the high computational costs associated with training such models.

Post-training strategies for improved image quality and control

Beyond initial pre-training, **post-training** focuses on refining the model's capabilities. **Continued training** involves exposing the model to a specific dataset (e.g., teddy bears) to specialize its knowledge. **Supervised fine-tuning** shifts the focus from knowledge to behavior, aiming to enhance aesthetics, lighting, or text adherence. A more advanced category involves **preference tuning**, which uses human feedback (or model-generated scores) to teach the model what constitutes a 'good' image. Methods like **Reward Feedback Learning** train a reward model to score images, guiding the diffusion model to produce higher-scoring outputs. **Flow Group Reward Policy Optimization (FlowGPO)** and **Diffusion DPU** are other approaches that leverage feedback to align model outputs with human preferences, although they carry risks like reward hacking. **Prompt enhancement** techniques are also crucial, transforming simple user prompts into detailed descriptions that better leverage the model's capabilities.

Personalization through fine-tuning: DreamBooth and LoRA

For specific use cases, **tuning** allows personalization. **DreamBooth** is a popular method for generating images of a specific subject (e.g., a unique teddy bear). It works by training the model using a few images of the subject, associated with a unique, rare token. A critical challenge is preventing the model from forgetting its general capabilities; this is addressed by incorporating a **prior preservation loss**. To manage the significant computational cost of updating large models, techniques like **Low-Rank Adaptation (LoRA)** are employed. LoRA only trains a fraction of the model's weights by introducing low-rank matrices, significantly reducing training time and cost while preserving performance. The trade-off is the cost and time required for training, especially if many subjects need to be personalized.

Distillation techniques for efficient inference

To make diffusion models practical for high-volume or real-time applications, **distillation** methods are essential. These techniques aim to preserve high-quality outputs while drastically reducing the number of inference steps. Traditional distillation involves training a smaller 'student' model to mimic a larger 'teacher' model. In diffusion models, this often means reducing the number of steps from thousands to one. **Progressive distillation** is one approach, where the number of steps is halved iteratively, allowing the student model to learn from progressively easier sub-problems. **InstaFlow** builds on rectified flow concepts, combining distillation with straighter paths for faster generation. **Consistency models** offer another avenue, enforcing that different noisy versions of an image all map to the same clean output, often trained with a teacher-student setup.

Advanced distillation for enhanced quality and stability

Further advancements in distillation move beyond simple regression losses like Mean Squared Error (MSE) to improve image quality. Using **learned perceptual image patch similarity (LPIPS)** can lead to crisper results by comparing feature maps from pre-trained models. Techniques inspired by Generative Adversarial Networks (GANs) introduce adversarial objectives to produce sharper images. The distribution loss, often framed using KL divergence, compares the distribution of generated images to the target distribution, encouraging the student model to mimic the teacher's output more closely. These methods, whether using GAN-like losses or distribution-based objectives, aim to stabilize training and enhance the visual fidelity of the generated images, often by performing operations in latent space to reduce computational overhead.

Text-to-Image Model Training Best Practices

Practical takeaways from this episode

Do This

Use Flow Matching loss as the default for training due to its efficiency.
Sample time steps from a Logit-Normal distribution, emphasizing middle noise levels for better learning.
Apply Time Step Shifting based on image resolution to account for perceived noise differences.
Leverage Representation Alignment (RePA) with pre-trained encoders to significantly speed up training.
Implement Curriculum Learning: start training on easy (low-resolution, simple prompt) images before moving to hard (high-resolution, complex prompt) ones.
Utilize Continued Training (CT) to expand model knowledge on specific domains (e.g., teddy bears).
Employ Supervised Fine-tuning (SFT) to improve model behavior, such as aesthetics or instruction following.
Implement Preference Tuning methods like Reward Feedback Learning or Flow GPO to align model outputs with human preferences.
Use Prompt Enhancement to transform simple user inputs into detailed prompts, leveraging the model's full capabilities.
Personalize models with DreamBooth for specific subjects, using rare tokens and prior preservation loss to prevent overfitting.
Use Low-Rank Adaptation (LoRA) for DreamBooth to drastically reduce training costs and computational burden.
Investigate distillation techniques (Progressive Distillation, InstaFlow, Consistency Models) for efficient inference and lower latency.
For image quality assessment in distillation, prefer LPIPS (Learned Perceptual Image Patch Similarity) over Mean Squared Error (MSE).
Consider GAN-like adversarial objectives (e.g., ADD) in distillation to achieve crisper image predictions.

Avoid This

Do not sample time steps from a uniform distribution, as it equally weights easy and hard learning tasks.
Avoid training DreamBooth without prior preservation loss, as it leads to overfitting and forgetting previous capabilities.
Do not directly reduce student model size to achieve distillation efficiency without other techniques, as it often leads to much worse quality.
Avoid excessive reflow steps in InstaFlow, as it can introduce discretization errors in generated pairs.
Do not backpropagate through both student models simultaneously in consistency training, to prevent collapse and ensure stable updates.
Do not rely solely on MSE loss for image quality evaluation, as it doesn't capture perceptual similarity well.

Common Questions

Text-to-image generation models typically involve a backbone like the Diffusion Transformer (DiT) for denoising noisy latents, a VAE for operating in a lower-dimensional latent space, and embedding models to inject conditions. The U-Net was dominant until 2022, but DiT-based models are now more prevalent. For example, Multimodal Diffusion Transformers (MMDiT) handle text as a standalone modality.

Topics

Mentioned in this video

Concepts
U-Net

An image generation architecture trendy until 2022, composed of downsampling and upsampling phases with copy and crop connections to capture global and local details.

Multimodal Diffusion Transformer

A variant of the Diffusion Transformer that treats input text as a standalone modality, injected directly rather than as an afterthought via modulation.

Variational Auto-Encoder

A model involved in image generation that provides a latent space for operations, lowering the computational burden of training and generation processes.

KL Divergence

A measure of how one probability distribution diverges from a second, expected probability distribution, used in distillation techniques, particularly in LLM contexts.

LPIPS

A loss formulation used in InstaFlow to compare image quality by passing images through a pre-trained model with frozen weights and comparing the L2 distance between feature maps, offering a more meaningful interpretation of visual quality.

Score-based Model

A second perspective on image generation that estimates the score acting as a compass to recover clean images, often framed with stochastic differential equations.

Rectified Flow

A technique that helps in straightening paths in the latent space, making it possible to take fewer Euler steps to reach the answer in image generation.

Logit-Normal Distribution

A distribution used to sample time steps for training, emphasizing middle noise levels more than early or late steps, improving model learning efficiency for hard tasks.

Reward Feedback Learning

A preference tuning method that trains a reward model from human preferences (pair-wise or list-wise ratings) and then tunes the image generation model to produce images with higher rewards.

Low-Rank Adaptation

A method commonly used in LLMs and now in image generation, which trains only a fraction of model weights by considering them as a base model's weights plus two low-rank matrices, preserving performance while reducing training cost.

Progressive Distillation

A distillation technique where a student model is trained by progressively working on problems of approximate constant difficulty, halving the number of steps at each iteration to reach a single-step generation.

CM296

The course name for which this lecture is part of, focusing on practical aspects of deep learning models.

Representation Alignment

A method published in 2024 that couples the model's loss with a loss reflecting similarity between internal diffusion transformer representations and pre-trained encoder representations, significantly speeding up training.

Continuing Training

A post-training method focused on increasing a model's knowledge by training it on a specific dataset related to a task of interest, such as generating teddy bear images.

CME 295

A sister class that covers topics like distilling knowledge for next token prediction in LLMs, relevant to reward feedback learning.

Flow Group Reward Policy Optimization

An algorithm adapted from the LLM field for image generation, aiming to increase diversity of generated images for a given prompt and using relative rewards to update the model policy.

Prompt Enhancement

A method that takes a simple user input and transforms it into a detailed, elegant prompt, matching the kind of prompts a model was extensively trained on to generate high-quality images.

mean squared error

A simple distance metric often used in warm-up phases of distillation, but not optimal for image quality as it focuses on pixel values rather than meaningful features.

Generative Adversarial Networks

A class of AI models mentioned for their ability to produce crisp images, suggesting the potential for adversarial objectives to improve image quality in distillation.

Curriculum Learning

A concept in machine learning that involves teaching a model in a progressive manner, starting with easier tasks (e.g., low-resolution images) before moving to harder ones.

Bradley-Terry model

A formulation used for pair-wise comparison losses, applicable in training reward models for preference tuning.

Diffusion Transformer

An architecture published in late 2022, relying on self-attention to form direct connections between different local parts of an image, mitigating U-Net's failure cases.

Adaptive Layer Normalization

A technique used in Diffusion Transformers to inject conditions and time steps as inputs, modulating token embeddings via gate shift and scale factors.

Flow matching

A third perspective that interprets the problem of going from noise to a target distribution as a transport problem, focusing on the vector field which can be solved using an ODE solver like Euler method. It is now widely preferred for its loss function characteristics.

Supervised Fine-tuning

A post-training method focused on improving a model's behavior, such as generating aesthetically pleasing images or better following text instructions.

Consistency Models

A distillation technique with the property of creating deterministic paths where every point on the path maps to the end goal, enabling early jumps from noise to a clean image.

More from Stanford Online

View all 52 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free