Key Moments
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 5 - Architectures
Key Moments
The Diffusion-Transformer (DiT) architecture leverages self-attention to process image patches as tokens, enabling global understanding and local detail preservation. However, its method of injecting conditioning information via adaptive layer normalization limits nuanced control over different image regions, prompting the development of multimodal diffusion transformers (MMDiTs).
Key Insights
The U-Net architecture, with its encoder-decoder structure and skip connections, preserves local details while achieving global receptive field, making it a foundational model for image generation tasks such as DDPM and Latent Diffusion Models.
Diffusion Transformer (DiT) models process image latents by dividing them into patches, treating each as a token and employing self-attention to capture long-range dependencies, overcoming limitations of convolutional architectures in understanding global image structure.
Adaptive Layer Normalization is a key method in DiTs for injecting conditioning information (time step, class labels) by modulating patch embeddings, showing superior performance in image generation quality compared to cross-attention or concatenation methods.
Rotary Position Embeddings (RoPE) offer a more effective way to inject positional information by rotating query and key vectors within the attention mechanism, leading to improved performance and interpretability compared to absolute position embeddings.
Multimodal Diffusion Transformers (MMDiTs), including models like Stable Diffusion 3, enhance conditioning by employing joint attention mechanisms, allowing image patches and text embeddings to interact more dynamically and improving control over generated content.
The effectiveness of DiT architectures scales with both transformer size and patch granularity, with the optimal balance achieved by scaling both dimensions simultaneously to maximize performance metrics like FID.
Foundational architectures: U-Net for image generation
The lecture begins by recapping the foundations of image generation, including diffusion, score matching, and flow matching, and the importance of representing images in a meaningful latent space, explored via Variational Autoencoders (VAEs). The U-Net architecture is introduced as a key model for image generation, designed to capture both global structure and local details. Its encoder-decoder structure with skip connections allows for downsampling to increase the receptive field and capture global context, followed by upsampling to reconstruct the image. The skip connections are crucial for transferring fine-grained local details from the encoding path to the decoding path, mitigating the loss of information during downsampling. This architecture has been instrumental in models like DDPM and Latent Diffusion Models, enabling them to generate high-resolution images. The input to these models is typically a noisy image (XT), a time step (t), and a condition (c), with the output being a prediction (e.g., noise, velocity) to denoise the image iteratively. The U-Net's ability to output a prediction of the same dimension as the input is essential for the iterative denoising process.
Representing time steps and conditions
To effectively inform the generation model, external signals like the time step 't' and conditioning information 'c' must be adequately represented. Time steps are often represented as high-dimensional vectors using sinusoidal functions, similar to Fourier series, where different dimensions capture oscillations at different frequencies. This mimics how humans perceive time through various units like seconds, minutes, and hours, which vary at different rates. For conditioning, which can be text prompts or other modalities, various methods exist. These include learning embeddings for predefined classes, using token embeddings from pre-trained language models (LLMs), or extracting embeddings from Vision Transformers (ViT) like the CLS token. These representations are then injected into the model, with common methods including feature map modulation, cross-attention, or simple concatenation.
The Diffusion Transformer (DiT): Applying self-attention to images
Inspired by the success of Transformers in natural language processing (NLP) and Vision Transformers (ViT) in image understanding, the lecture introduces the Diffusion Transformer (DiT). This architecture adapts the Transformer's self-attention mechanism for image generation. Instead of processing sequences of words, DiT treats an image's latent representation as a grid of patches, each considered a 'token'. Self-attention then allows each patch to interact with all other patches, enabling a powerful understanding of global structure and long-range dependencies that convolutional U-Nets might struggle with. This global interaction capability is particularly beneficial for tasks requiring a holistic understanding of an image, such as generating complex scenes or ensuring consistency across distant image regions. Conditioning information (time step and class labels) is injected, with adaptive layer normalization (AdaLN) emerging as a highly effective method. AdaLN modulates the patch embeddings based on the conditioning signals, allowing for fine-grained control over the generation process and achieving superior image quality compared to other injection methods like cross-attention.
Injecting conditions with adaptive layer normalization
Adaptive Layer Normalization (AdaLN) is a sophisticated technique used in DiTs to integrate conditioning information. It operates by taking the time step and class label embeddings, processing them through a Multi-Layer Perceptron (MLP) to generate 'gate', 'scale,' and 'shift' parameters (alpha, gamma, beta). These parameters are then used to modulate the patch embeddings within the transformer blocks. Specifically, alpha acts as a gate, controlling how much of the transformation passes through; gamma scales the embedding, highlighting or downplaying certain features; and beta shifts the embedding. This modulation allows the model to prioritize specific learned dimensions within the patch embeddings based on the input condition. For example, when generating a 'brown fluffy teddy bear,' AdaLN can increase the intensity of dimensions related to 'brownness' and 'fluffiness' at early stages of generation (high noise, low 't') and focus on finer texture details at later stages (low noise, high 't'). This approach proved more effective than methods like cross-attention or simple concatenation, leading to better image quality, as indicated by lower FID scores.
Limitations of DiT and the rise of Multimodal Diffusion Transformers (MMDiTs)
Despite the strengths of DiTs, their approach to injecting conditioning via AdaLN has a key limitation: it applies uniform modulation across all image patches. This is problematic for complex prompts where different spatial regions require distinct conditioning. For instance, generating an image with a 'brown teddy bear' in one area and 'white walls' in another cannot be effectively handled by a single modulation applied universally. This limitation led to the development of Multimodal Diffusion Transformers (MMDiTs). MMDiTs aim to provide more nuanced ways to integrate cross-modal information, moving beyond uniform modulation. Two primary classes of methods have emerged: cross-attention and joint attention. Cross-attention allows image patches (queries) to attend to relevant parts of the text prompt (keys/values) independently. Joint attention, on the other hand, treats image patch embeddings and text embeddings together, allowing them to mutually attend to each other within the same attention layer, fostering a more integrated understanding across modalities.
Advancements in positional encoding and 2D representation
A crucial aspect for Transformer-based models, including DiTs and MMDiTs, is handling positional information. The original Transformers used absolute positional embeddings, which were added to input embeddings. However, Rotary Position Embeddings (RoPE), introduced in 2021, offer a more advanced approach. RoPE encodes position by applying rotations to query and key vectors within the attention mechanism, where the rotation angle depends on the position difference. This method has shown superior performance and interpretability. For 2D image data, RoPE has been adapted using strategies like axial RoPE (segregating x and y axes) and mixed RoPE (interleaving x and y rotations). These methods aim to capture spatial relationships effectively. Additionally, techniques like centered coordinates (used in Dream2) and relative positional embeddings in cross-attention (e.g., diagonal encoding in Quinn Image) help models better understand spatial relationships, especially between different modalities like text and image tokens.
Scaling and future directions in image generation architectures
The lecture concludes by emphasizing that the field of image generation architectures is highly dynamic, with continuous innovation. The effectiveness of DiT and MMDiT models is strongly linked to their scale, both in terms of the number of parameters (transformer size) and the granularity of patch representation. Research indicates that scaling both dimensions synergistically yields the best results, as measured by metrics like FID. While specific architectures like U-Nets, DiTs, and MMDiTs provide foundational concepts, the exact implementation details and the combination of techniques are constantly evolving. Future architectures will likely continue to build upon self-attention, explore more sophisticated methods for multimodal fusion, and refine positional encoding strategies to achieve even higher fidelity and control in image generation. The ongoing development suggests a trend towards models that can handle increasingly complex prompts and generate highly realistic and diverse visual content.
Mentioned in This Episode
●Software & Apps
●Books
●Concepts
Common Questions
The U-Net architecture downsamples input to capture global image structure and then upsamples it back to the original size, using skip connections to preserve local details, making it effective for tasks like image segmentation and generation.
Topics
Mentioned in this video
A model published in 2023 that scaled the U-Net architecture significantly.
The model released in 2024 that first coined the term MMDIT and relies on joint attention.
An example of a state-of-the-art double-stream MMDIT architecture for image generation, released in 2025.
A model that addresses how to make sense of positions when dealing with different numbers of patches, by centering coordinates to help the model locate different parts of the image relative to the center.
A vision generation equivalent of the ViT, introduced in 2022, relying on self-attention to allow all image patches to interact with one another for image generation.
An architecture coined in the Stable Diffusion 3 paper, relying on joint attention of different modalities for image generation, overcoming limitations of DiT.
An example of a single-stream MMDIT architecture, treating all modalities equally, released in 2025.
A hybrid MMDIT model that mixes both single-stream and double-stream layers, released in 2025.
Another strategy mentioned to combat blurriness in images generated by VAEs.
A metric used to quantify the quality of generated images, where a lower FID value indicates better image quality.
A method to inject conditions where image patch embeddings (queries) attend to text embeddings (keys and values) to determine relevance for image changes.
A common loss function used in diffusion models to predict the noise to be removed, expressed as an L2 regression loss.
A method for injecting external signals (time step, class label) into the Diffusion Transformer by modulating patch embeddings via learned gate, scale, and shift coefficients, found to be the most performant.
A leading method for representing positions in attention layers, introduced in a 2021 paper, which rotates queries and keys based on their positions.
A 2D position embedding strategy that mixes rotations with respect to both X and Y axes within the same vector, mitigating issues of lacking interaction between axes seen in Axial RoPE.
A strategy mentioned to combat blurriness in images generated by VAEs.
A 2D position embedding strategy that segregates axes, representing vectors according to their coordinates in X and Y parts of the vector, used in the original DiT paper.
An architecture for representing images in a meaningful latent space, structuring it with constraints on distribution behavior, but can lead to blurry images.
A method to condition the generation process, allowing models to emphasize more on the input prompt by predicting noise or velocity twice with a guidance factor.
An image generation model shaped like a 'U' that downsamples the input to understand global structure and then upsamples it back to the original dimension while preserving local details through direct connections (skip connections).
Applied the encoder part of the Transformer architecture to image understanding by cutting images into patches and processing them like tokens with self-attention.
A method to inject conditions where both image patch embeddings and text embeddings are considered jointly and attended through the same self-attention layer.
An architecture that revolutionized the NLP field in 2017, based on the concept of self-attention mechanisms, which can also be applied to images.
A numerical method for solving differential equations, used in the iterative process of the DiT to follow the vector field and transition from an initial latent to a target latent.
More from Stanford Online
View all 48 summaries
69 minStanford CS153 Frontier Systems | Jensen Huang from NVIDIA on the Compute Behind Intelligence
61 minStanford CS153 Frontier Systems | Scott Nolan from General Matter on Energy Bottlenecks
63 minStanford Robotics Seminar ENGR319 | Spring 2026 | Unlocking Autonomous Medical Robotics
86 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 10: Inference
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free