Key Moments

Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence

Stanford OnlineStanford Online
Education6 min read65 min video
Jun 4, 2026|2,775 views|150|6
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

Multimodal AI models are increasingly capable of understanding and generating content across text, images, and audio. However, unifying image generation and understanding within a single architecture remains a significant challenge.

Key Insights

1

Native multimodal language models convert diverse inputs (images, audio, video) into 'tokens' processed by transformer architectures, enabling unified training and prompting.

2

Chameleon models tokenize images into discrete representations using VQAE, allowing interleaved text-image generation but showing a performance gap in image understanding compared to continuous methods.

3

Transfusion models combine autoregressive text modeling with diffusion-based image generation, achieving better image quality and token efficiency but still face challenges in unifying generation and understanding.

4

Mixture of Transformers (MoT) employs modality-specific parameters for different inputs (text, image, audio), significantly improving non-text modality generation without sacrificing text performance.

5

Despite advancements, current multimodal models excel primarily at digital information processing, with significant open problems remaining for real-world, physical intelligence, robotics control, and spatial-temporal understanding.

6

While understanding capabilities in multimodal models can enhance generation, training for generation (e.g., video generation) has shown little positive transfer to improving understanding capabilities, unlike language models which inherently capture reasoning.

Bridging the gap: From language models to multimodal intelligence

Large Language Models (LLMs) have revolutionized AI through next-token prediction on symbolic information, demonstrating emergent capabilities in knowledge acquisition, instruction following, and reasoning. However, the real world and digital environments are inherently multimodal, encompassing images, audio, and video alongside text. To build AI systems that interact seamlessly with this rich sensory input, the field is moving towards native multimodal language models. These models aim to process not only symbolic knowledge but also diverse sensory data by converting various modalities into a common 'token' representation, enabling them to be processed by transformer architectures in a unified manner. This approach allows for the transfer of architectural and training principles from LLMs to multimodal settings, facilitating capabilities like prompting, instruction following, and even reasoning and planning across different modalities.

Tokenizing the world: Representing multimodal data

A key philosophy behind many state-of-the-art multimodal models is the concept of tokenization across various modalities. For text, standard tokenization methods like Byte Pair Encoding are used. For images, a 'patchification' process divides an image into small, fixed-size patches (e.g., 16x16 pixels). Each patch is then encoded into a vector representation, and these sequences of vectors form 'image tokens'. Similarly, audio waveforms can be transformed and processed to generate audio tokens. Videos are treated as a sequence of image frames, with each frame undergoing patchification and encoding, concatenating the resulting tokens to represent the video as a temporal sequence of tokens. Not all 'tokens' are necessarily discrete; dense vector representations are also referred to as tokens. This universal tokenization allows multimodal data to be fed into transformer models, enabling them to learn from interleaved sequences of various data types.

Two paths for multimodal output: Text-only vs. Omni-models

Multimodal models generally fall into two categories based on their output capabilities. The first type accepts multimodal input but generates only text output. Models like Gemini, Quora, and Kimi often operate this way, excelling at understanding images, videos, or audio and answering questions or providing descriptions in text. While these companies may develop separate models for multimodal generation, their core products focus on text-only output for understanding tasks. The second category, termed 'Omni-models,' goes further by generating not only text but also other modalities like images and audio—examples include models like GPT-4.0. This distinction is crucial as it highlights the varying ambitions in multimodal AI, from sophisticated input understanding to comprehensive cross-modal generation.

Chameleon: Discretizing images for unified generation

The Chameleon family of models explores the hypothesis of treating all modalities as discrete tokens. For images, this involves an extra step after patchification: vector embeddings of the patches are mapped to a learned codebook via VQAE (Vector Quantized-Variational Autoencoder). This process converts image patches into discrete tokens, represented by their indices in the codebook. These image tokens are then interleaved with text tokens, and the entire sequence is trained using a standard cross-entropy objective, similar to language models. Chameleon demonstrated impressive capabilities in generating interleaved text and image sequences, enabling tasks like chatting, brainstorming, and image comparison. However, discretizing images can lead to significant information loss, resulting in a performance gap in image understanding compared to models using continuous image encodings (like SigLip). Additionally, discrete generation can be less token-efficient, requiring more data to produce well-formed images, suggesting the goal of discretizing all modalities might be too strong an assumption.

Transfusion: Unifying diffusion and autoregression

Transfusion addresses some limitations of discrete tokenization by adopting continuous image representations. It integrates diffusion models, known for high-quality image generation from noise, with autoregressive language modeling within a single transformer. The model takes interleaved text-image sequences as input: text is processed autoregressively, while image segments undergo diffusion-based generation. This involves starting from noise and iteratively refining it until a clear image is produced, after which the generated image can serve as input for subsequent steps. Transfusion demonstrates superior image generation quality and token efficiency compared to discrete token-based methods. However, it faces an open research problem: the continuous representations used for efficient image generation are not always ideal for image understanding tasks, creating a dilemma between optimized generation and understanding. Modern Omni-models often use separate encodings for these dual purposes.

Mixture of Transformers (MoT): Modality-specific parameters

To improve efficiency and performance in multimodal processing, the Mixture of Transformers (MoT) architecture introduces modality-specific parameters within the transformer backbone. The intuition is that different modalities, like text and images, have distinct information densities and characteristics, so a unified set of parameters might not be optimal. MoT assigns independent sets of parameters (e.g., for QKV projections and feed-forward layers) to each modality. During processing, a deterministic routing mechanism activates the appropriate parameters based on the token's modality. While a joint attention mechanism allows for cross-modal interaction, the subsequent feed-forward layers are modality-specific. Experiments show that MoT significantly enhances the generation quality of non-text modalities like images and speech, without compromising text performance. This is attributed to reducing capacity competition within a single transformer, allowing specialized scaling for each modality. MoT can also be combined with Mixture of Experts (MoE) for further scaling, potentially with custom expert allocation per modality, and facilitates asynchronous training, enabling easier extension of existing text models with new modalities.

The generation-understanding asymmetry: A puzzling phenomenon

A key area of ongoing research is the transferability of capabilities between multimodal understanding and generation. While strong understanding capabilities in a base model demonstrably improve generation quality (e.g., more detailed images, less hallucination in infographics), training models specifically for generation has shown little positive transfer back to understanding. This asymmetry is puzzling, especially when contrasted with language models, where next-token prediction surprisingly leads to robust reasoning and knowledge acquisition. Hypotheses for this gap include the fundamental differences between language (an abstraction of cognition) and sensory data like images/videos (passive observations), the more complex loss landscapes associated with visual data, and inherent redundancy in sequential visual data. This suggests that simply applying LLM principles to other modalities might not be sufficient, and fundamental challenges in multimodal representation and learning remain.

Future directions and remaining challenges

The field of native multimodal intelligence is rapidly evolving, with ongoing research in areas like object-oriented embeddings for visual elements (e.g., JAPA models) and unifying representations for perception, generation, and reasoning. While current models excel at digital information processing, significant challenges persist for real-world applications like robotics, spatial-temporal understanding, and physical intelligence. Multimodal models are computationally more demanding, requiring advanced infrastructure. Future work will likely focus on customizing models for specific capabilities and exploring how to unify these diverse functionalities into coherent systems. The effectiveness of language as a backbone for reasoning in multimodal tasks is clear, but whether pure vision/audio models can achieve similar reasoning depth remains an open question, alongside exploring alternative training paradigms beyond pure next-token prediction.

Common Questions

Native multimodal language models are AI systems designed to process and generate information from various modalities (text, images, audio, video) seamlessly, leveraging the transformer architecture and tokenization across modalities.

Topics

Mentioned in this video

More from Stanford Online

View all 72 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free