Key Moments
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 6 - Model Training
Key Moments
Training diffusion models for text-to-image generation involves a multi-stage process from pre-training to fine-tuning and distillation, with newer techniques like REPA speeding up training by 18x.
Key Insights
The default loss function for diffusion models is shifting towards flow matching.
Sampling timestep 't' from a logit normal distribution, rather than uniformly, emphasizes middle steps where denoising is hardest.
Representational Alignment (REPA) can speed up diffusion transformer training by an order of magnitude (e.g., 18x) by matching pre-trained encoder representations.
Post-training can involve continued training for knowledge expansion or supervised fine-tuning for behavioral improvements like aesthetics and prompt adherence.
Distillation techniques, such as progressive distillation and InstaFlow, aim to reduce the number of inference steps for faster image generation, potentially from 1000s to just one.
Consistency models and GAN-like objectives offer alternative training approaches that can lead to crisper image generation and more stable training dynamics.
The multi-stage training lifecycle of text-to-image models
The training process for text-to-image generation models is broken down into distinct phases. It begins with **pre-training**, a compute-intensive stage where the model learns to generate general images using vast datasets. This is followed by **post-training**, which refines the model's output and can include **continued training** to expand its knowledge base or **supervised fine-tuning** to improve specific behaviors like aesthetic quality or prompt adherence. An optional **tuning** phase allows for personalization, such as generating images of a specific subject. Finally, **distillation** methods are employed to make the trained model more efficient for production by reducing inference time and computational cost.
Optimizing training with flow matching and timestep sampling
The primary loss function for training diffusion models is increasingly favoring the **flow matching** perspective, which frames image generation as a transport problem. A crucial aspect of training involves sampling the noise level or 'timestep' (t). Instead of a uniform distribution, practical training often utilizes a **logit normal distribution**. This strategy is employed because tasks are easiest at the extremes of the noise spectrum (very noisy or nearly clean) and hardest in the middle. By sampling more frequently from steps where denoising is more challenging, the model's learning is focused on the most critical aspects of the generation process, leading to more robust performance.
Enhancing training efficiency with representation alignment
The **Representation Alignment (REPA)** method offers a significant speed-up in training diffusion transformers. It works by incorporating an additional loss term that encourages the representations learned by the diffusion model's layers to align with those of a pre-trained encoder. This approach acts like providing a 'book' to a learner—it guides the model, enabling it to learn faster and more effectively. Studies show REPA can accelerate training by as much as 18 times, with benefits being more pronounced in earlier layers of the transformer and for larger models. This technique is particularly valuable given the high computational costs associated with training such models.
Post-training strategies for improved image quality and control
Beyond initial pre-training, **post-training** focuses on refining the model's capabilities. **Continued training** involves exposing the model to a specific dataset (e.g., teddy bears) to specialize its knowledge. **Supervised fine-tuning** shifts the focus from knowledge to behavior, aiming to enhance aesthetics, lighting, or text adherence. A more advanced category involves **preference tuning**, which uses human feedback (or model-generated scores) to teach the model what constitutes a 'good' image. Methods like **Reward Feedback Learning** train a reward model to score images, guiding the diffusion model to produce higher-scoring outputs. **Flow Group Reward Policy Optimization (FlowGPO)** and **Diffusion DPU** are other approaches that leverage feedback to align model outputs with human preferences, although they carry risks like reward hacking. **Prompt enhancement** techniques are also crucial, transforming simple user prompts into detailed descriptions that better leverage the model's capabilities.
Personalization through fine-tuning: DreamBooth and LoRA
For specific use cases, **tuning** allows personalization. **DreamBooth** is a popular method for generating images of a specific subject (e.g., a unique teddy bear). It works by training the model using a few images of the subject, associated with a unique, rare token. A critical challenge is preventing the model from forgetting its general capabilities; this is addressed by incorporating a **prior preservation loss**. To manage the significant computational cost of updating large models, techniques like **Low-Rank Adaptation (LoRA)** are employed. LoRA only trains a fraction of the model's weights by introducing low-rank matrices, significantly reducing training time and cost while preserving performance. The trade-off is the cost and time required for training, especially if many subjects need to be personalized.
Distillation techniques for efficient inference
To make diffusion models practical for high-volume or real-time applications, **distillation** methods are essential. These techniques aim to preserve high-quality outputs while drastically reducing the number of inference steps. Traditional distillation involves training a smaller 'student' model to mimic a larger 'teacher' model. In diffusion models, this often means reducing the number of steps from thousands to one. **Progressive distillation** is one approach, where the number of steps is halved iteratively, allowing the student model to learn from progressively easier sub-problems. **InstaFlow** builds on rectified flow concepts, combining distillation with straighter paths for faster generation. **Consistency models** offer another avenue, enforcing that different noisy versions of an image all map to the same clean output, often trained with a teacher-student setup.
Advanced distillation for enhanced quality and stability
Further advancements in distillation move beyond simple regression losses like Mean Squared Error (MSE) to improve image quality. Using **learned perceptual image patch similarity (LPIPS)** can lead to crisper results by comparing feature maps from pre-trained models. Techniques inspired by Generative Adversarial Networks (GANs) introduce adversarial objectives to produce sharper images. The distribution loss, often framed using KL divergence, compares the distribution of generated images to the target distribution, encouraging the student model to mimic the teacher's output more closely. These methods, whether using GAN-like losses or distribution-based objectives, aim to stabilize training and enhance the visual fidelity of the generated images, often by performing operations in latent space to reduce computational overhead.
Mentioned in This Episode
●Software & Apps
●Concepts
Text-to-Image Model Training Best Practices
Practical takeaways from this episode
Do This
Avoid This
Common Questions
Text-to-image generation models typically involve a backbone like the Diffusion Transformer (DiT) for denoising noisy latents, a VAE for operating in a lower-dimensional latent space, and embedding models to inject conditions. The U-Net was dominant until 2022, but DiT-based models are now more prevalent. For example, Multimodal Diffusion Transformers (MMDiT) handle text as a standalone modality.
Topics
Mentioned in this video
An image generation architecture trendy until 2022, composed of downsampling and upsampling phases with copy and crop connections to capture global and local details.
A variant of the Diffusion Transformer that treats input text as a standalone modality, injected directly rather than as an afterthought via modulation.
A model involved in image generation that provides a latent space for operations, lowering the computational burden of training and generation processes.
A measure of how one probability distribution diverges from a second, expected probability distribution, used in distillation techniques, particularly in LLM contexts.
A loss formulation used in InstaFlow to compare image quality by passing images through a pre-trained model with frozen weights and comparing the L2 distance between feature maps, offering a more meaningful interpretation of visual quality.
A second perspective on image generation that estimates the score acting as a compass to recover clean images, often framed with stochastic differential equations.
A technique that helps in straightening paths in the latent space, making it possible to take fewer Euler steps to reach the answer in image generation.
A distribution used to sample time steps for training, emphasizing middle noise levels more than early or late steps, improving model learning efficiency for hard tasks.
A preference tuning method that trains a reward model from human preferences (pair-wise or list-wise ratings) and then tunes the image generation model to produce images with higher rewards.
A method commonly used in LLMs and now in image generation, which trains only a fraction of model weights by considering them as a base model's weights plus two low-rank matrices, preserving performance while reducing training cost.
A distillation technique where a student model is trained by progressively working on problems of approximate constant difficulty, halving the number of steps at each iteration to reach a single-step generation.
The course name for which this lecture is part of, focusing on practical aspects of deep learning models.
A method published in 2024 that couples the model's loss with a loss reflecting similarity between internal diffusion transformer representations and pre-trained encoder representations, significantly speeding up training.
A post-training method focused on increasing a model's knowledge by training it on a specific dataset related to a task of interest, such as generating teddy bear images.
A sister class that covers topics like distilling knowledge for next token prediction in LLMs, relevant to reward feedback learning.
An algorithm adapted from the LLM field for image generation, aiming to increase diversity of generated images for a given prompt and using relative rewards to update the model policy.
A method that takes a simple user input and transforms it into a detailed, elegant prompt, matching the kind of prompts a model was extensively trained on to generate high-quality images.
A simple distance metric often used in warm-up phases of distillation, but not optimal for image quality as it focuses on pixel values rather than meaningful features.
A class of AI models mentioned for their ability to produce crisp images, suggesting the potential for adversarial objectives to improve image quality in distillation.
A concept in machine learning that involves teaching a model in a progressive manner, starting with easier tasks (e.g., low-resolution images) before moving to harder ones.
A formulation used for pair-wise comparison losses, applicable in training reward models for preference tuning.
An architecture published in late 2022, relying on self-attention to form direct connections between different local parts of an image, mitigating U-Net's failure cases.
A technique used in Diffusion Transformers to inject conditions and time steps as inputs, modulating token embeddings via gate shift and scale factors.
A third perspective that interprets the problem of going from noise to a target distribution as a transport problem, focusing on the vector field which can be solved using an ODE solver like Euler method. It is now widely preferred for its loss function characteristics.
A post-training method focused on improving a model's behavior, such as generating aesthetically pleasing images or better following text instructions.
A distillation technique with the property of creating deterministic paths where every point on the path maps to the end goal, enabling early jumps from noise to a clean image.
A popular method for personalizing image generation models to produce specific subjects or objects based on a few input images and a rare token.
A distilled version of BERT, saving approximately 60% of parameters while retaining 97% of performance, used as an example of successful distillation.
A foundational language model mentioned as an example of distillation in the LLM world, where DistilBERT achieves significant parameter reduction while retaining performance.
A paper that follows up on rectified flow, applying it and adding a layer of distillation to further solidify single-step results in image generation.
A type of consistency model that operates in the latent space, mentioned as a competitive technique in the literature.
More from Stanford Online
View all 52 summaries
79 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 12: Evaluation
83 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 13: Data (Sources, Datasets)
78 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 11: Scaling Laws
69 minStanford CS153 Frontier Systems | Jensen Huang from NVIDIA on the Compute Behind Intelligence
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free