What is the difference between models with multimodal input and omni models?

Models with multimodal input primarily process various data types but only output text. Omni models, on the other hand, can both process multimodal input and generate output in multiple modalities, including images and audio.

How does the Chameleon model discretize images?

The Chameleon model discretizes images by dividing them into patches, encoding these patches into embeddings, and then using a learned vector codebook to find the closest discrete representation for each patch.

What is the main advantage of the Transfusion model?

The Transfusion model seamlessly unifies autoregressive language modeling with diffusion-based image generation, demonstrating significantly better image generation quality and token efficiency compared to discrete token-based approaches.

How does the Mixture of Transformers (MOT) architecture differ from dense models?

MOT uses independent sets of transformer parameters for each modality (e.g., text, image) and deterministic routing, allowing for specialized processing and improved generation quality for non-text modalities without sacrificing text performance.

Can modality-specific parameters improve multimodal model training stability?

Yes, MOT's modality-specific parameters can improve training stability by allowing for asynchronous training of different modalities and making fine-tuning existing models with new modalities easier.

Does training for image generation improve understanding capabilities?

Currently, there is little evidence that training models for image generation significantly improves their understanding capabilities. However, better understanding capabilities strongly enhance generation quality.

Why do language models seem better at reasoning than video models?

One hypothesis is that language is a compressed abstraction of human cognition, encoding reasoning processes. Images and videos are sensory data, and their loss landscapes and data redundancy might make reasoning harder to learn directly.

What are the limitations of current omni models?

Current omni models excel at digital information processing but are far from bridging the gap to powerful physical world multimodal intelligence. They are also computationally heavy, posing infrastructure challenges.

How can information be transferred between modalities in a Mixture of Transformers architecture?

Information transfer occurs through the self-attention mechanism, which allows different modality tokens to interact after being projected by their specific parameters, and via the autoregressive causal conditioning structure.

Is it feasible to unify different modalities by representing text as images?

While an interesting idea, representing text as images might be less efficient than using native text tokens due to potential inefficiencies in capturing symbolic structure and the overhead of OCR or image representation. However, it's worth experimenting with.

Can a single representation capture enough for perception, generation, and reasoning?

Research is ongoing, with promising results in unifying representations for image generation and understanding. If successful, this would bring multimodal modeling closer to language modeling's unified input/output capabilities.

Key Moments

Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence

Stanford Online

Education6 min read65 min video

Jun 4, 2026|2,775 views|150|6

Stanford Stanford Online AI Artificial Intelligence Transformers

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Multimodal AI models are increasingly capable of understanding and generating content across text, images, and audio. However, unifying image generation and understanding within a single architecture remains a significant challenge.

Key Insights

Native multimodal language models convert diverse inputs (images, audio, video) into 'tokens' processed by transformer architectures, enabling unified training and prompting.

Chameleon models tokenize images into discrete representations using VQAE, allowing interleaved text-image generation but showing a performance gap in image understanding compared to continuous methods.

Transfusion models combine autoregressive text modeling with diffusion-based image generation, achieving better image quality and token efficiency but still face challenges in unifying generation and understanding.

Mixture of Transformers (MoT) employs modality-specific parameters for different inputs (text, image, audio), significantly improving non-text modality generation without sacrificing text performance.

Despite advancements, current multimodal models excel primarily at digital information processing, with significant open problems remaining for real-world, physical intelligence, robotics control, and spatial-temporal understanding.

While understanding capabilities in multimodal models can enhance generation, training for generation (e.g., video generation) has shown little positive transfer to improving understanding capabilities, unlike language models which inherently capture reasoning.

Bridging the gap: From language models to multimodal intelligence

Large Language Models (LLMs) have revolutionized AI through next-token prediction on symbolic information, demonstrating emergent capabilities in knowledge acquisition, instruction following, and reasoning. However, the real world and digital environments are inherently multimodal, encompassing images, audio, and video alongside text. To build AI systems that interact seamlessly with this rich sensory input, the field is moving towards native multimodal language models. These models aim to process not only symbolic knowledge but also diverse sensory data by converting various modalities into a common 'token' representation, enabling them to be processed by transformer architectures in a unified manner. This approach allows for the transfer of architectural and training principles from LLMs to multimodal settings, facilitating capabilities like prompting, instruction following, and even reasoning and planning across different modalities.

Tokenizing the world: Representing multimodal data

A key philosophy behind many state-of-the-art multimodal models is the concept of tokenization across various modalities. For text, standard tokenization methods like Byte Pair Encoding are used. For images, a 'patchification' process divides an image into small, fixed-size patches (e.g., 16x16 pixels). Each patch is then encoded into a vector representation, and these sequences of vectors form 'image tokens'. Similarly, audio waveforms can be transformed and processed to generate audio tokens. Videos are treated as a sequence of image frames, with each frame undergoing patchification and encoding, concatenating the resulting tokens to represent the video as a temporal sequence of tokens. Not all 'tokens' are necessarily discrete; dense vector representations are also referred to as tokens. This universal tokenization allows multimodal data to be fed into transformer models, enabling them to learn from interleaved sequences of various data types.

Two paths for multimodal output: Text-only vs. Omni-models

Multimodal models generally fall into two categories based on their output capabilities. The first type accepts multimodal input but generates only text output. Models like Gemini, Quora, and Kimi often operate this way, excelling at understanding images, videos, or audio and answering questions or providing descriptions in text. While these companies may develop separate models for multimodal generation, their core products focus on text-only output for understanding tasks. The second category, termed 'Omni-models,' goes further by generating not only text but also other modalities like images and audio—examples include models like GPT-4.0. This distinction is crucial as it highlights the varying ambitions in multimodal AI, from sophisticated input understanding to comprehensive cross-modal generation.

Chameleon: Discretizing images for unified generation

The Chameleon family of models explores the hypothesis of treating all modalities as discrete tokens. For images, this involves an extra step after patchification: vector embeddings of the patches are mapped to a learned codebook via VQAE (Vector Quantized-Variational Autoencoder). This process converts image patches into discrete tokens, represented by their indices in the codebook. These image tokens are then interleaved with text tokens, and the entire sequence is trained using a standard cross-entropy objective, similar to language models. Chameleon demonstrated impressive capabilities in generating interleaved text and image sequences, enabling tasks like chatting, brainstorming, and image comparison. However, discretizing images can lead to significant information loss, resulting in a performance gap in image understanding compared to models using continuous image encodings (like SigLip). Additionally, discrete generation can be less token-efficient, requiring more data to produce well-formed images, suggesting the goal of discretizing all modalities might be too strong an assumption.

Transfusion: Unifying diffusion and autoregression

Transfusion addresses some limitations of discrete tokenization by adopting continuous image representations. It integrates diffusion models, known for high-quality image generation from noise, with autoregressive language modeling within a single transformer. The model takes interleaved text-image sequences as input: text is processed autoregressively, while image segments undergo diffusion-based generation. This involves starting from noise and iteratively refining it until a clear image is produced, after which the generated image can serve as input for subsequent steps. Transfusion demonstrates superior image generation quality and token efficiency compared to discrete token-based methods. However, it faces an open research problem: the continuous representations used for efficient image generation are not always ideal for image understanding tasks, creating a dilemma between optimized generation and understanding. Modern Omni-models often use separate encodings for these dual purposes.

Mixture of Transformers (MoT): Modality-specific parameters

To improve efficiency and performance in multimodal processing, the Mixture of Transformers (MoT) architecture introduces modality-specific parameters within the transformer backbone. The intuition is that different modalities, like text and images, have distinct information densities and characteristics, so a unified set of parameters might not be optimal. MoT assigns independent sets of parameters (e.g., for QKV projections and feed-forward layers) to each modality. During processing, a deterministic routing mechanism activates the appropriate parameters based on the token's modality. While a joint attention mechanism allows for cross-modal interaction, the subsequent feed-forward layers are modality-specific. Experiments show that MoT significantly enhances the generation quality of non-text modalities like images and speech, without compromising text performance. This is attributed to reducing capacity competition within a single transformer, allowing specialized scaling for each modality. MoT can also be combined with Mixture of Experts (MoE) for further scaling, potentially with custom expert allocation per modality, and facilitates asynchronous training, enabling easier extension of existing text models with new modalities.

The generation-understanding asymmetry: A puzzling phenomenon

A key area of ongoing research is the transferability of capabilities between multimodal understanding and generation. While strong understanding capabilities in a base model demonstrably improve generation quality (e.g., more detailed images, less hallucination in infographics), training models specifically for generation has shown little positive transfer back to understanding. This asymmetry is puzzling, especially when contrasted with language models, where next-token prediction surprisingly leads to robust reasoning and knowledge acquisition. Hypotheses for this gap include the fundamental differences between language (an abstraction of cognition) and sensory data like images/videos (passive observations), the more complex loss landscapes associated with visual data, and inherent redundancy in sequential visual data. This suggests that simply applying LLM principles to other modalities might not be sufficient, and fundamental challenges in multimodal representation and learning remain.

Future directions and remaining challenges

The field of native multimodal intelligence is rapidly evolving, with ongoing research in areas like object-oriented embeddings for visual elements (e.g., JAPA models) and unifying representations for perception, generation, and reasoning. While current models excel at digital information processing, significant challenges persist for real-world applications like robotics, spatial-temporal understanding, and physical intelligence. Multimodal models are computationally more demanding, requiring advanced infrastructure. Future work will likely focus on customizing models for specific capabilities and exploring how to unify these diverse functionalities into coherent systems. The effectiveness of language as a backbone for reasoning in multimodal tasks is clear, but whether pure vision/audio models can achieve similar reasoning depth remains an open question, alongside exploring alternative training paradigms beyond pure next-token prediction.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Common Questions

Native multimodal language models are AI systems designed to process and generate information from various modalities (text, images, audio, video) seamlessly, leveraging the transformer architecture and tokenization across modalities.

Topics

Neuroscience & the Brain AI & Machine Learning Technology & Innovation Science & Mathematics Language Models Deep Learning Image Generation Multimodal AI Audio Processing AI Research Transformer Architecture Visual Understanding

Mentioned in this video

Organizations

University of Washington

Institution where Victoria Lin received her PhD.

Meta AI

Victoria Lin's previous employer where she was a research scientist.

Salesforce AI research

Victoria Lin's previous employer where she was a research scientist.

People

Victoria Lin

Technical member of staff at Thinking Machines Lab, focusing on native multimodal intelligence. Previously a research scientist at Meta AI and Salesforce AI research.

Companies

Thinking Machines Lab

Victoria Lin's current affiliation where she works on native multimodal intelligence.

Concepts

Large Language Models

Discussed as a major breakthrough in recent years, with widespread daily use for tasks like answering questions and coding. Built on next token prediction.

Transformer

The core architecture underlying modern large language models, responsible for next token prediction.

Multimodal information

Information coming from different modalities such as images, audio, and video, which AI systems need to handle for real-world interaction.

Mixture-of-Experts

An architectural technique that can be applied to multimodal language models for better scaling and performance.

JEPA world models

A concept discussed in the Q&A as a potential architecture for real-world interaction and multimodal understanding, aiming for abstraction similar to language.

diffusion models

A successful approach for image generation that works by iteratively removing noise to produce a clear image, integrated into the Transfusion model.

Software & Apps

ChatGPT

Mentioned as an example of a large language model used for everyday tasks like asking questions.

Quen

Mentioned as a multimodal language model with multimodal input and text-only output.

GPT-4o

An example of an 'omni model' capable of generating not only text but also images.

SigLip

A state-of-the-art multimodal language model that uses continuous image encoding, contrasted with Chameleon's discrete approach.

Transfusion

A multimodal model architecture that combines autoregressive language modeling with diffusion-based image generation.

Mixture of Transformers

An architecture that uses independent sets of transformer parameters for each modality (text, image, audio) to improve efficiency and generation quality.

Beagle

An omni model similar to MOT, with separate parameters for image generation and a multimodal backbone for understanding.

Codex

Mentioned as a tool used for coding projects, often in conjunction with large language models.

Gemini

A multimodal language model mentioned as an example of models with multimodal input and text-only output.

Kemi

Mentioned as a multimodal language model with multimodal input and text-only output.

Products

Chameleon

A family of multimodal models that hypothesize discretizing every modality into tokens, including images using VQAE.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free