Key Moments
Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Multimodal AI models are increasingly capable of understanding and generating content across text, images, and audio. However, unifying image generation and understanding within a single architecture remains a significant challenge.
Key Insights
Native multimodal language models convert diverse inputs (images, audio, video) into 'tokens' processed by transformer architectures, enabling unified training and prompting.
Chameleon models tokenize images into discrete representations using VQAE, allowing interleaved text-image generation but showing a performance gap in image understanding compared to continuous methods.
Transfusion models combine autoregressive text modeling with diffusion-based image generation, achieving better image quality and token efficiency but still face challenges in unifying generation and understanding.
Mixture of Transformers (MoT) employs modality-specific parameters for different inputs (text, image, audio), significantly improving non-text modality generation without sacrificing text performance.
Despite advancements, current multimodal models excel primarily at digital information processing, with significant open problems remaining for real-world, physical intelligence, robotics control, and spatial-temporal understanding.
While understanding capabilities in multimodal models can enhance generation, training for generation (e.g., video generation) has shown little positive transfer to improving understanding capabilities, unlike language models which inherently capture reasoning.
Bridging the gap: From language models to multimodal intelligence
Large Language Models (LLMs) have revolutionized AI through next-token prediction on symbolic information, demonstrating emergent capabilities in knowledge acquisition, instruction following, and reasoning. However, the real world and digital environments are inherently multimodal, encompassing images, audio, and video alongside text. To build AI systems that interact seamlessly with this rich sensory input, the field is moving towards native multimodal language models. These models aim to process not only symbolic knowledge but also diverse sensory data by converting various modalities into a common 'token' representation, enabling them to be processed by transformer architectures in a unified manner. This approach allows for the transfer of architectural and training principles from LLMs to multimodal settings, facilitating capabilities like prompting, instruction following, and even reasoning and planning across different modalities.
Tokenizing the world: Representing multimodal data
A key philosophy behind many state-of-the-art multimodal models is the concept of tokenization across various modalities. For text, standard tokenization methods like Byte Pair Encoding are used. For images, a 'patchification' process divides an image into small, fixed-size patches (e.g., 16x16 pixels). Each patch is then encoded into a vector representation, and these sequences of vectors form 'image tokens'. Similarly, audio waveforms can be transformed and processed to generate audio tokens. Videos are treated as a sequence of image frames, with each frame undergoing patchification and encoding, concatenating the resulting tokens to represent the video as a temporal sequence of tokens. Not all 'tokens' are necessarily discrete; dense vector representations are also referred to as tokens. This universal tokenization allows multimodal data to be fed into transformer models, enabling them to learn from interleaved sequences of various data types.
Two paths for multimodal output: Text-only vs. Omni-models
Multimodal models generally fall into two categories based on their output capabilities. The first type accepts multimodal input but generates only text output. Models like Gemini, Quora, and Kimi often operate this way, excelling at understanding images, videos, or audio and answering questions or providing descriptions in text. While these companies may develop separate models for multimodal generation, their core products focus on text-only output for understanding tasks. The second category, termed 'Omni-models,' goes further by generating not only text but also other modalities like images and audio—examples include models like GPT-4.0. This distinction is crucial as it highlights the varying ambitions in multimodal AI, from sophisticated input understanding to comprehensive cross-modal generation.
Chameleon: Discretizing images for unified generation
The Chameleon family of models explores the hypothesis of treating all modalities as discrete tokens. For images, this involves an extra step after patchification: vector embeddings of the patches are mapped to a learned codebook via VQAE (Vector Quantized-Variational Autoencoder). This process converts image patches into discrete tokens, represented by their indices in the codebook. These image tokens are then interleaved with text tokens, and the entire sequence is trained using a standard cross-entropy objective, similar to language models. Chameleon demonstrated impressive capabilities in generating interleaved text and image sequences, enabling tasks like chatting, brainstorming, and image comparison. However, discretizing images can lead to significant information loss, resulting in a performance gap in image understanding compared to models using continuous image encodings (like SigLip). Additionally, discrete generation can be less token-efficient, requiring more data to produce well-formed images, suggesting the goal of discretizing all modalities might be too strong an assumption.
Transfusion: Unifying diffusion and autoregression
Transfusion addresses some limitations of discrete tokenization by adopting continuous image representations. It integrates diffusion models, known for high-quality image generation from noise, with autoregressive language modeling within a single transformer. The model takes interleaved text-image sequences as input: text is processed autoregressively, while image segments undergo diffusion-based generation. This involves starting from noise and iteratively refining it until a clear image is produced, after which the generated image can serve as input for subsequent steps. Transfusion demonstrates superior image generation quality and token efficiency compared to discrete token-based methods. However, it faces an open research problem: the continuous representations used for efficient image generation are not always ideal for image understanding tasks, creating a dilemma between optimized generation and understanding. Modern Omni-models often use separate encodings for these dual purposes.
Mixture of Transformers (MoT): Modality-specific parameters
To improve efficiency and performance in multimodal processing, the Mixture of Transformers (MoT) architecture introduces modality-specific parameters within the transformer backbone. The intuition is that different modalities, like text and images, have distinct information densities and characteristics, so a unified set of parameters might not be optimal. MoT assigns independent sets of parameters (e.g., for QKV projections and feed-forward layers) to each modality. During processing, a deterministic routing mechanism activates the appropriate parameters based on the token's modality. While a joint attention mechanism allows for cross-modal interaction, the subsequent feed-forward layers are modality-specific. Experiments show that MoT significantly enhances the generation quality of non-text modalities like images and speech, without compromising text performance. This is attributed to reducing capacity competition within a single transformer, allowing specialized scaling for each modality. MoT can also be combined with Mixture of Experts (MoE) for further scaling, potentially with custom expert allocation per modality, and facilitates asynchronous training, enabling easier extension of existing text models with new modalities.
The generation-understanding asymmetry: A puzzling phenomenon
A key area of ongoing research is the transferability of capabilities between multimodal understanding and generation. While strong understanding capabilities in a base model demonstrably improve generation quality (e.g., more detailed images, less hallucination in infographics), training models specifically for generation has shown little positive transfer back to understanding. This asymmetry is puzzling, especially when contrasted with language models, where next-token prediction surprisingly leads to robust reasoning and knowledge acquisition. Hypotheses for this gap include the fundamental differences between language (an abstraction of cognition) and sensory data like images/videos (passive observations), the more complex loss landscapes associated with visual data, and inherent redundancy in sequential visual data. This suggests that simply applying LLM principles to other modalities might not be sufficient, and fundamental challenges in multimodal representation and learning remain.
Future directions and remaining challenges
The field of native multimodal intelligence is rapidly evolving, with ongoing research in areas like object-oriented embeddings for visual elements (e.g., JAPA models) and unifying representations for perception, generation, and reasoning. While current models excel at digital information processing, significant challenges persist for real-world applications like robotics, spatial-temporal understanding, and physical intelligence. Multimodal models are computationally more demanding, requiring advanced infrastructure. Future work will likely focus on customizing models for specific capabilities and exploring how to unify these diverse functionalities into coherent systems. The effectiveness of language as a backbone for reasoning in multimodal tasks is clear, but whether pure vision/audio models can achieve similar reasoning depth remains an open question, alongside exploring alternative training paradigms beyond pure next-token prediction.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Concepts
●People Referenced
Common Questions
Native multimodal language models are AI systems designed to process and generate information from various modalities (text, images, audio, video) seamlessly, leveraging the transformer architecture and tokenization across modalities.
Topics
Mentioned in this video
Discussed as a major breakthrough in recent years, with widespread daily use for tasks like answering questions and coding. Built on next token prediction.
The core architecture underlying modern large language models, responsible for next token prediction.
Information coming from different modalities such as images, audio, and video, which AI systems need to handle for real-world interaction.
An architectural technique that can be applied to multimodal language models for better scaling and performance.
A concept discussed in the Q&A as a potential architecture for real-world interaction and multimodal understanding, aiming for abstraction similar to language.
A successful approach for image generation that works by iteratively removing noise to produce a clear image, integrated into the Transfusion model.
Mentioned as an example of a large language model used for everyday tasks like asking questions.
Mentioned as a multimodal language model with multimodal input and text-only output.
An example of an 'omni model' capable of generating not only text but also images.
A state-of-the-art multimodal language model that uses continuous image encoding, contrasted with Chameleon's discrete approach.
A multimodal model architecture that combines autoregressive language modeling with diffusion-based image generation.
An architecture that uses independent sets of transformer parameters for each modality (text, image, audio) to improve efficiency and generation quality.
An omni model similar to MOT, with separate parameters for image generation and a multimodal backbone for understanding.
Mentioned as a tool used for coding projects, often in conjunction with large language models.
A multimodal language model mentioned as an example of models with multimodal input and text-only output.
Mentioned as a multimodal language model with multimodal input and text-only output.
More from Stanford Online
View all 72 summaries
78 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality
64 minStanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
83 minStanford CS25: Transformers United V6 I Serving Transformers: Lessons from the Trenches
110 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free