How are external conditions like time steps and text prompts incorporated into image generation models?

Time steps are represented as sinusoidal functions in a D-dimensional vector, while conditions (like text prompts) are often converted into embeddings using pre-trained LLMs. These can be injected into the model by adding them to feature maps, modulating tokens, or using cross-attention mechanisms.

What are the limitations of the U-Net architecture for generating images with specific long-range dependencies?

U-Net uses localized convolutions, which means information gets mixed as it moves further from a given pixel. This makes it difficult to cleanly capture and relate local details in widely separated parts of an image, like a teddy bear next to a mirror needing similar low-level texture far apart.

How does the Diffusion Transformer (DiT) model incorporate conditioning for image generation?

The DiT model uses an adaptive layer normalization (Adaptive LayerNorm) mechanism. It takes the time step and class label embeddings, combines them, and uses an MLP to generate gate, scale, and shift coefficients that modulate the patch embeddings, highlighting relevant dimensions for the current generation stage.

What is the end-to-end process for generating an image using a Diffusion Transformer?

The process starts by sampling from a Gaussian distribution in the latent space. This latent is then divided into patches, embedded, and positionally encoded. The DiT iteratively predicts the velocity to denoise the latent, using the Euler solver, until a final clean latent is obtained. This final latent is then passed through a VAE decoder to produce the pixel-space image.

Why is the Multimodal Diffusion Transformer (MMDIT) preferred over the original DiT for localized conditioning?

The original DiT modulates all patch embeddings uniformly based on text input. MMDIT, however, uses joint attention to allow image and text tokens to interact directly, enabling localized changes. For instance, it can emphasize 'brown' for a teddy bear and 'white' for a wall in different image regions, which DiT cannot do.

What is the difference between single-stream and double-stream MMDIT models?

Single-stream MMDIT models treat all modalities (image and text tokens) equally, processing them through the same layers and projections. Double-stream MMDIT models assign separate processing streams and modality-specific projections, allowing for tailored tools and learning for each type of input.

How do Rotary Position Embeddings (RoPE) address the limitations of absolute position embeddings?

RoPE rotates queries and keys in the attention layer based on their positions, instead of adding position embeddings to the input. This design choice makes the dot product depend on the difference in positions, offering a more intuitive and effective way to incorporate relative position information directly at the attention mechanism.

What are the current strategies for handling 2D position embeddings in Diffusion Transformers?

Two main strategies are Axial RoPE, which segregates X and Y coordinates into different parts of the vector, and Mixed RoPE, which mixes X and Y rotations within the same vector. Mixed RoPE generally performs better by allowing interaction between the two axes.

Key Moments

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 5 - Architectures

Stanford Online

Education7 min read107 min video

May 11, 2026|4,050 views|115|2

Stanford Stanford Online LLM Large Language Models

Save to Pod

Key Moments

TL;DR

The Diffusion-Transformer (DiT) architecture leverages self-attention to process image patches as tokens, enabling global understanding and local detail preservation. However, its method of injecting conditioning information via adaptive layer normalization limits nuanced control over different image regions, prompting the development of multimodal diffusion transformers (MMDiTs).

Key Insights

The U-Net architecture, with its encoder-decoder structure and skip connections, preserves local details while achieving global receptive field, making it a foundational model for image generation tasks such as DDPM and Latent Diffusion Models.

Diffusion Transformer (DiT) models process image latents by dividing them into patches, treating each as a token and employing self-attention to capture long-range dependencies, overcoming limitations of convolutional architectures in understanding global image structure.

Adaptive Layer Normalization is a key method in DiTs for injecting conditioning information (time step, class labels) by modulating patch embeddings, showing superior performance in image generation quality compared to cross-attention or concatenation methods.

Rotary Position Embeddings (RoPE) offer a more effective way to inject positional information by rotating query and key vectors within the attention mechanism, leading to improved performance and interpretability compared to absolute position embeddings.

Multimodal Diffusion Transformers (MMDiTs), including models like Stable Diffusion 3, enhance conditioning by employing joint attention mechanisms, allowing image patches and text embeddings to interact more dynamically and improving control over generated content.

The effectiveness of DiT architectures scales with both transformer size and patch granularity, with the optimal balance achieved by scaling both dimensions simultaneously to maximize performance metrics like FID.

Foundational architectures: U-Net for image generation

The lecture begins by recapping the foundations of image generation, including diffusion, score matching, and flow matching, and the importance of representing images in a meaningful latent space, explored via Variational Autoencoders (VAEs). The U-Net architecture is introduced as a key model for image generation, designed to capture both global structure and local details. Its encoder-decoder structure with skip connections allows for downsampling to increase the receptive field and capture global context, followed by upsampling to reconstruct the image. The skip connections are crucial for transferring fine-grained local details from the encoding path to the decoding path, mitigating the loss of information during downsampling. This architecture has been instrumental in models like DDPM and Latent Diffusion Models, enabling them to generate high-resolution images. The input to these models is typically a noisy image (XT), a time step (t), and a condition (c), with the output being a prediction (e.g., noise, velocity) to denoise the image iteratively. The U-Net's ability to output a prediction of the same dimension as the input is essential for the iterative denoising process.

Representing time steps and conditions

To effectively inform the generation model, external signals like the time step 't' and conditioning information 'c' must be adequately represented. Time steps are often represented as high-dimensional vectors using sinusoidal functions, similar to Fourier series, where different dimensions capture oscillations at different frequencies. This mimics how humans perceive time through various units like seconds, minutes, and hours, which vary at different rates. For conditioning, which can be text prompts or other modalities, various methods exist. These include learning embeddings for predefined classes, using token embeddings from pre-trained language models (LLMs), or extracting embeddings from Vision Transformers (ViT) like the CLS token. These representations are then injected into the model, with common methods including feature map modulation, cross-attention, or simple concatenation.

The Diffusion Transformer (DiT): Applying self-attention to images

Inspired by the success of Transformers in natural language processing (NLP) and Vision Transformers (ViT) in image understanding, the lecture introduces the Diffusion Transformer (DiT). This architecture adapts the Transformer's self-attention mechanism for image generation. Instead of processing sequences of words, DiT treats an image's latent representation as a grid of patches, each considered a 'token'. Self-attention then allows each patch to interact with all other patches, enabling a powerful understanding of global structure and long-range dependencies that convolutional U-Nets might struggle with. This global interaction capability is particularly beneficial for tasks requiring a holistic understanding of an image, such as generating complex scenes or ensuring consistency across distant image regions. Conditioning information (time step and class labels) is injected, with adaptive layer normalization (AdaLN) emerging as a highly effective method. AdaLN modulates the patch embeddings based on the conditioning signals, allowing for fine-grained control over the generation process and achieving superior image quality compared to other injection methods like cross-attention.

Injecting conditions with adaptive layer normalization

Adaptive Layer Normalization (AdaLN) is a sophisticated technique used in DiTs to integrate conditioning information. It operates by taking the time step and class label embeddings, processing them through a Multi-Layer Perceptron (MLP) to generate 'gate', 'scale,' and 'shift' parameters (alpha, gamma, beta). These parameters are then used to modulate the patch embeddings within the transformer blocks. Specifically, alpha acts as a gate, controlling how much of the transformation passes through; gamma scales the embedding, highlighting or downplaying certain features; and beta shifts the embedding. This modulation allows the model to prioritize specific learned dimensions within the patch embeddings based on the input condition. For example, when generating a 'brown fluffy teddy bear,' AdaLN can increase the intensity of dimensions related to 'brownness' and 'fluffiness' at early stages of generation (high noise, low 't') and focus on finer texture details at later stages (low noise, high 't'). This approach proved more effective than methods like cross-attention or simple concatenation, leading to better image quality, as indicated by lower FID scores.

Limitations of DiT and the rise of Multimodal Diffusion Transformers (MMDiTs)

Despite the strengths of DiTs, their approach to injecting conditioning via AdaLN has a key limitation: it applies uniform modulation across all image patches. This is problematic for complex prompts where different spatial regions require distinct conditioning. For instance, generating an image with a 'brown teddy bear' in one area and 'white walls' in another cannot be effectively handled by a single modulation applied universally. This limitation led to the development of Multimodal Diffusion Transformers (MMDiTs). MMDiTs aim to provide more nuanced ways to integrate cross-modal information, moving beyond uniform modulation. Two primary classes of methods have emerged: cross-attention and joint attention. Cross-attention allows image patches (queries) to attend to relevant parts of the text prompt (keys/values) independently. Joint attention, on the other hand, treats image patch embeddings and text embeddings together, allowing them to mutually attend to each other within the same attention layer, fostering a more integrated understanding across modalities.

Advancements in positional encoding and 2D representation

A crucial aspect for Transformer-based models, including DiTs and MMDiTs, is handling positional information. The original Transformers used absolute positional embeddings, which were added to input embeddings. However, Rotary Position Embeddings (RoPE), introduced in 2021, offer a more advanced approach. RoPE encodes position by applying rotations to query and key vectors within the attention mechanism, where the rotation angle depends on the position difference. This method has shown superior performance and interpretability. For 2D image data, RoPE has been adapted using strategies like axial RoPE (segregating x and y axes) and mixed RoPE (interleaving x and y rotations). These methods aim to capture spatial relationships effectively. Additionally, techniques like centered coordinates (used in Dream2) and relative positional embeddings in cross-attention (e.g., diagonal encoding in Quinn Image) help models better understand spatial relationships, especially between different modalities like text and image tokens.

Scaling and future directions in image generation architectures

The lecture concludes by emphasizing that the field of image generation architectures is highly dynamic, with continuous innovation. The effectiveness of DiT and MMDiT models is strongly linked to their scale, both in terms of the number of parameters (transformer size) and the granularity of patch representation. Research indicates that scaling both dimensions synergistically yields the best results, as measured by metrics like FID. While specific architectures like U-Nets, DiTs, and MMDiTs provide foundational concepts, the exact implementation details and the combination of techniques are constantly evolving. Future architectures will likely continue to build upon self-attention, explore more sophisticated methods for multimodal fusion, and refine positional encoding strategies to achieve even higher fidelity and control in image generation. The ongoing development suggests a trend towards models that can handle increasingly complex prompts and generate highly realistic and diverse visual content.

Mentioned in This Episode

●Software & Apps

●Books

●Concepts

Common Questions

The U-Net architecture downsamples input to capture global image structure and then upsamples it back to the original size, using skip connections to preserve local details, making it effective for tasks like image segmentation and generation.

Topics

AI & Machine Learning Programming & Software Science & Mathematics Image Generation Diffusion Models Multimodal AI Computer Vision Deep Learning Architectures Positional Encoding Latent Space Transformer Networks

Mentioned in this video

Software & Apps

Stable Diffusion XL

A model published in 2023 that scaled the U-Net architecture significantly.

Stable Diffusion

The model released in 2024 that first coined the term MMDIT and relies on joint attention.

Quen Image

An example of a state-of-the-art double-stream MMDIT architecture for image generation, released in 2025.

Cdream 2

A model that addresses how to make sense of positions when dealing with different numbers of patches, by centering coordinates to help the model locate different parts of the image relative to the center.

Diffusion Transformer

A vision generation equivalent of the ViT, introduced in 2022, relying on self-attention to allow all image patches to interact with one another for image generation.

Multimodal Diffusion Transformer

An architecture coined in the Stable Diffusion 3 paper, relying on joint attention of different modalities for image generation, overcoming limitations of DiT.

Z Image

An example of a single-stream MMDIT architecture, treating all modalities equally, released in 2025.

FluxOneContext

A hybrid MMDIT model that mixes both single-stream and double-stream layers, released in 2025.

Concepts

Perceptual Loss

Another strategy mentioned to combat blurriness in images generated by VAEs.

Fresh Inception Distance

A metric used to quantify the quality of generated images, where a lower FID value indicates better image quality.

Cross Attention

A method to inject conditions where image patch embeddings (queries) attend to text embeddings (keys and values) to determine relevance for image changes.

L2 Loss

A common loss function used in diffusion models to predict the noise to be removed, expressed as an L2 regression loss.

Adaptive Layer Norm

A method for injecting external signals (time step, class label) into the Diffusion Transformer by modulating patch embeddings via learned gate, scale, and shift coefficients, found to be the most performant.

RoPE

A leading method for representing positions in attention layers, introduced in a 2021 paper, which rotates queries and keys based on their positions.

Mixed RoPE

A 2D position embedding strategy that mixes rotations with respect to both X and Y axes within the same vector, mitigating issues of lacking interaction between axes seen in Axial RoPE.

Adversarial Loss

A strategy mentioned to combat blurriness in images generated by VAEs.

Axial RoPE

A 2D position embedding strategy that segregates axes, representing vectors according to their coordinates in X and Y parts of the vector, used in the original DiT paper.

Variational Autoencoder

An architecture for representing images in a meaningful latent space, structuring it with constraints on distribution behavior, but can lead to blurry images.

Classifier-Free Guidance

A method to condition the generation process, allowing models to emphasize more on the input prompt by predicting noise or velocity twice with a guidance factor.

U-Net

An image generation model shaped like a 'U' that downsamples the input to understand global structure and then upsamples it back to the original dimension while preserving local details through direct connections (skip connections).

Vision Transformer

Applied the encoder part of the Transformer architecture to image understanding by cutting images into patches and processing them like tokens with self-attention.

Joint Attention

A method to inject conditions where both image patch embeddings and text embeddings are considered jointly and attended through the same self-attention layer.

Transformer

An architecture that revolutionized the NLP field in 2017, based on the concept of self-attention mechanisms, which can also be applied to images.

Oiler Solver

A numerical method for solving differential equations, used in the iterative process of the DiT to follow the vector field and transition from an initial latent to a target latent.

Books

DDPM

The Diffusion Models paper that utilized the U-Net architecture, covered in Lecture 1.

Latent Diffusion Model

A paper that popularized generating images in the latent space and also used a form of U-Net.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free