Key Moments
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Multimodal AI models can now process and generate combinations of text, images, and video by treating all data as 'tokens', but this discretization sacrifices fine-grained detail crucial for tasks like OCR.
Key Insights
The ultimate goal in multimodal AI is an 'omnimodel' capable of processing and generating any combination of text, images, audio, and video.
CLIP, a foundational multimodal model, uses contrastive language-image pre-training to align image and text embeddings, achieving 50% higher ImageNet accuracy in zero-shot settings compared to a ResNet trained on 1.2 million ImageNet images.
Si-CiP improved upon CLIP by using a sigmoid loss for binary classification of image-text pairs, enabling more efficient training and reducing reliance on large batch sizes.
Lava, a vision-language model (VLM), stitches together off-the-shelf image encoders (like CLIP) and language models, using a two-stage training process for 'alignment' and fine-tuning, and later versions (Lava 1.5, Lava Next) incorporated multiple images and video with 'any-resolution' (any-res) processing.
Quen 2 introduced dynamic resolution and multimodal rotary positional embeddings (multi-modal rope) for handling varying image sizes and temporal data in videos, achieving up to 16,000 tokens for video sequences.
Chameleon offers a novel approach by mapping all modalities, including images, into discrete tokens using Vector Quantized Variational Autoencoders (VQ-VAEs), enabling end-to-end language model training but facing challenges with training stability and information loss due to discretization.
The quest for an omnimodel and tokenization challenges
The lecture begins by introducing the concept of 'omnimodels' – AI systems capable of processing and generating any combination of modalities: text, images, audio, and video. This contrasts with current language models, which are primarily text-based. The core challenge in building such models lies in converting diverse data types into a format that transformers, which operate on 'tokens,' can process. While text tokenization is relatively straightforward (e.g., using BPE), non-text modalities like images and audio require more complex methods to be represented as discrete or continuous tokens that capture meaningful semantic units, rather than raw pixels.
CLIP: Aligning images and text through contrastive learning
The CLIP (Contrastive Language-Image Pre-training) model, developed by OpenAI, is presented as a foundational step in multimodal AI. Inspired by the success of large-scale language models trained on internet text, CLIP leverages vast amounts of image-text pairs scraped from the web. Its objective is to learn embeddings where corresponding image and text pairs are close in the embedding space, while non-corresponding pairs are distant. CLIP uses a vision transformer (ViT) as its image encoder and a GPT-2 style transformer for text. The training involves a contrastive loss on an N x N matrix of image-text similarity scores. A key result highlighted is CLIP's zero-shot performance on ImageNet, outperforming a ResNet trained on 1.2 million labeled images, demonstrating the power of learning from naturally occurring web data.
Si-CiP: Enhancing efficiency with a simpler loss
Si-CiP (Sigmoid Loss for Language Image Pre-training) is introduced as an improvement over CLIP, addressing its reliance on large batch sizes and a complex multiclass classification loss. Si-CiP simplifies the objective to a binary classification task for each image-text pair: are they aligned or not? This is achieved using a sigmoid loss, which is more computationally efficient and less sensitive to batch size variations. The paper also details significant training speedups, reducing training time from 10 days on 256 TPU v3s for CLIP to 5 days on 32 TPU v4s for Si-CiP, showcasing advancements in distributed training strategies for multimodal models.
Lava and its successors: Stitching encoders and LLMs
The lecture then moves to Vision-Language Models (VLMs), focusing on the Lava and Quen families. Lava (2023) exemplifies a common VLM architecture: it takes a pre-trained vision encoder (CLIP) and a pre-trained language model (e.g., Vicuna) and connects them using a learned projection matrix (adapter). This 'stitching' approach involves a two-stage training process: first, aligning the image embedding space to the text embedding space by training only the adapter, and second, fine-tuning the adapter and the language model on synthesized visual-reasoning data. Lava 1.5 and Lava Next (2024) further advanced this by incorporating multiple images and videos, introducing the 'any-resolution' (any-res) approach to handle variable image sizes by tiling and encoding image crops, and upgrading to Si-CiP as the vision encoder.
Quen models: Dynamic resolution and multimodal rope
The Quen series of models, starting with QuenVL, also follows a similar template but introduces distinct innovations. Quen 2, for instance, adopted dynamic resolution handling, a crucial step for multimodal inputs, and introduced 'multimodal rotary positional embeddings' (multi-modal rope). This positional encoding method extends the concept of rotary embeddings to three dimensions: height, width, and time, allowing the model to understand spatial and temporal relationships more effectively. Quen 3 further refined this by interleaving dimensions in the rope embeddings to expose all axes to low and high frequencies and introduced explicit video timestamps as tokens, enabling more direct temporal reasoning. These models also focus on long context understanding, with Quen 3 reaching up to 256K tokens, and employ sophisticated training pipelines involving multiple stages.
Chameleon: Towards a discretely-tokenized omnimodel
Chameleon from Meta (2024) presents a different paradigm by aiming to map all modalities into discrete tokens, similar to text. It utilizes Vector Quantized Variational Autoencoders (VQ-VAEs) to convert images into sequences of discrete codes from a learned codebook. This allows for end-to-end training of a single language model on mixed text and image tokens. While aesthetically elegant and simplifying the training architecture, Chameleon faces challenges with training stability due to the inherent differences in entropy between text and image tokens. The discretization process also leads to information loss, particularly detrimental for tasks requiring fine-grained detail like OCR. Despite its innovative approach, the authors note that diffusion models currently offer better performance for generation tasks, and this VQ-VAE based approach is less popular for that reason.
Key challenges and future directions in multimodality
The lecture concludes by summarizing the core challenges and future directions in multimodal AI. A fundamental issue is effectively handling non-text modalities and balancing their different information densities to avoid overwhelming the model (e.g., video frames vs. text tokens). The choice of encoder is critical: high-level semantics for classification might be captured by smaller, well-aligned continuous encoders like CLIP, while tasks requiring fine detail, such as OCR or image generation, necessitate fine-grained representations, where diffusion models excel. The trend is towards increasingly complex, multi-stage training pipelines and larger context windows. The ultimate goal remains the development of natively multimodal or omnimodels, likely leveraging a combination of continuous encoders for rich representation and powerful generative models like diffusion, though the exact architectures of state-of-the-art closed-source models like Gemini and GPT-4 remain proprietary.
Mentioned in This Episode
●Software & Apps
●Organizations
Common Questions
Multimodality refers to AI models that can process and understand information from multiple types of data, such as text, images, audio, and video, going beyond just text-based inputs.
Topics
Mentioned in this video
Conversations with ChatGPT were used to synthesize data for training LLaVA models.
Used to synthesize training data for LLaVA by generating questions and conversations based on image captions and objects.
An evolution of LLaVA that improved handling of multiple images and videos, utilizing Si-CLIP as the vision encoder.
A type of neural network architecture used for image classification; CLIP outperformed a ResNet trained on a large dataset in zero-shot evaluation.
A transformer-based architecture adapted for vision tasks, found to perform best in CLIP models.
Mentioned as a precursor to foundation models in the language domain before CLIP's development in vision.
An open-source replication and extension of CLIP, utilizing the LAION-5B dataset.
A family of vision-language models that inject image embeddings into a language model; LLaVA 1.5 can handle multiple images and videos.
The foundational language model used in early LLaVA versions, fine-tuned on shared GPT conversations.
Mentioned as a contemporary multimodal frontier model known for its capabilities across various modalities.
More from Stanford Online
View all 72 summaries
83 minStanford CS25: Transformers United V6 I Serving Transformers: Lessons from the Trenches
64 minStanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
65 minStanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence
110 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free