How does the CLIP model work?

CLIP (Contrastive Language-Image Pre-training) works by encoding images and text into embeddings and then training a model to maximize the similarity (dot product) between embeddings of corresponding image-text pairs, while minimizing similarity with unrelated pairs.

What are Vision Transformers (ViTs) used for?

ViTs are transformer architectures adapted for image processing. They break images into patches, treat them as tokens, and process them through a standard transformer, forming the basis of the vision encoder in models like CLIP.

What was the significance of CLIP's zero-shot performance?

CLIP achieved impressive performance on benchmarks like ImageNet without specific training for those tasks, demonstrating its ability to generalize semantic understanding across image and text modalities.

How does Si-CLIP improve upon CLIP?

Si-CLIP simplifies the training objective by using a binary classification (aligned or not aligned) with a sigmoid loss, making it more efficient and easier to train than CLIP's multiclass classification approach.

What is LLaVA and how does it integrate vision and language?

LLaVA (Large Language and Vision Assistant) models combine a pre-trained vision encoder (like CLIP) with a language model. Image embeddings are projected into the language model's space, allowing it to process and generate text based on visual input.

What advancements did LLaVA 1.5 bring?

LLaVA 1.5 improved upon earlier versions by handling multiple images and videos, using a more advanced vision encoder (Si-CLIP), and employing a two-layer MLP projector for better feature mapping.

How do Queasy models differ from LLaVA?

Queasy models often involve training the vision encoder alongside the adapter in early stages and have explored various techniques like dynamic resolution and multimodal rotary positional embeddings (Mo-RoPE) across their versions.

What is Chameleon's approach to multimodality?

Chameleon attempts to unify different modalities by mapping them all into discrete tokens, enabling a single language model to process and generate text and images, though with potential stability and information loss issues.

What are the main challenges in building multimodal models?

Key challenges include effectively handling non-text modalities, balancing information density across different data types (like video vs. text), and ensuring fine-grained detail preservation, especially for tasks like OCR or image generation.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality

Stanford Online

Education5 min read78 min video

Jun 4, 2026|3,173 views|68|1

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Multimodal AI models can now process and generate combinations of text, images, and video by treating all data as 'tokens', but this discretization sacrifices fine-grained detail crucial for tasks like OCR.

Key Insights

The ultimate goal in multimodal AI is an 'omnimodel' capable of processing and generating any combination of text, images, audio, and video.

CLIP, a foundational multimodal model, uses contrastive language-image pre-training to align image and text embeddings, achieving 50% higher ImageNet accuracy in zero-shot settings compared to a ResNet trained on 1.2 million ImageNet images.

Si-CiP improved upon CLIP by using a sigmoid loss for binary classification of image-text pairs, enabling more efficient training and reducing reliance on large batch sizes.

Lava, a vision-language model (VLM), stitches together off-the-shelf image encoders (like CLIP) and language models, using a two-stage training process for 'alignment' and fine-tuning, and later versions (Lava 1.5, Lava Next) incorporated multiple images and video with 'any-resolution' (any-res) processing.

Quen 2 introduced dynamic resolution and multimodal rotary positional embeddings (multi-modal rope) for handling varying image sizes and temporal data in videos, achieving up to 16,000 tokens for video sequences.

Chameleon offers a novel approach by mapping all modalities, including images, into discrete tokens using Vector Quantized Variational Autoencoders (VQ-VAEs), enabling end-to-end language model training but facing challenges with training stability and information loss due to discretization.

The quest for an omnimodel and tokenization challenges

The lecture begins by introducing the concept of 'omnimodels' – AI systems capable of processing and generating any combination of modalities: text, images, audio, and video. This contrasts with current language models, which are primarily text-based. The core challenge in building such models lies in converting diverse data types into a format that transformers, which operate on 'tokens,' can process. While text tokenization is relatively straightforward (e.g., using BPE), non-text modalities like images and audio require more complex methods to be represented as discrete or continuous tokens that capture meaningful semantic units, rather than raw pixels.

CLIP: Aligning images and text through contrastive learning

The CLIP (Contrastive Language-Image Pre-training) model, developed by OpenAI, is presented as a foundational step in multimodal AI. Inspired by the success of large-scale language models trained on internet text, CLIP leverages vast amounts of image-text pairs scraped from the web. Its objective is to learn embeddings where corresponding image and text pairs are close in the embedding space, while non-corresponding pairs are distant. CLIP uses a vision transformer (ViT) as its image encoder and a GPT-2 style transformer for text. The training involves a contrastive loss on an N x N matrix of image-text similarity scores. A key result highlighted is CLIP's zero-shot performance on ImageNet, outperforming a ResNet trained on 1.2 million labeled images, demonstrating the power of learning from naturally occurring web data.

Si-CiP: Enhancing efficiency with a simpler loss

Si-CiP (Sigmoid Loss for Language Image Pre-training) is introduced as an improvement over CLIP, addressing its reliance on large batch sizes and a complex multiclass classification loss. Si-CiP simplifies the objective to a binary classification task for each image-text pair: are they aligned or not? This is achieved using a sigmoid loss, which is more computationally efficient and less sensitive to batch size variations. The paper also details significant training speedups, reducing training time from 10 days on 256 TPU v3s for CLIP to 5 days on 32 TPU v4s for Si-CiP, showcasing advancements in distributed training strategies for multimodal models.

Lava and its successors: Stitching encoders and LLMs

The lecture then moves to Vision-Language Models (VLMs), focusing on the Lava and Quen families. Lava (2023) exemplifies a common VLM architecture: it takes a pre-trained vision encoder (CLIP) and a pre-trained language model (e.g., Vicuna) and connects them using a learned projection matrix (adapter). This 'stitching' approach involves a two-stage training process: first, aligning the image embedding space to the text embedding space by training only the adapter, and second, fine-tuning the adapter and the language model on synthesized visual-reasoning data. Lava 1.5 and Lava Next (2024) further advanced this by incorporating multiple images and videos, introducing the 'any-resolution' (any-res) approach to handle variable image sizes by tiling and encoding image crops, and upgrading to Si-CiP as the vision encoder.

Quen models: Dynamic resolution and multimodal rope

The Quen series of models, starting with QuenVL, also follows a similar template but introduces distinct innovations. Quen 2, for instance, adopted dynamic resolution handling, a crucial step for multimodal inputs, and introduced 'multimodal rotary positional embeddings' (multi-modal rope). This positional encoding method extends the concept of rotary embeddings to three dimensions: height, width, and time, allowing the model to understand spatial and temporal relationships more effectively. Quen 3 further refined this by interleaving dimensions in the rope embeddings to expose all axes to low and high frequencies and introduced explicit video timestamps as tokens, enabling more direct temporal reasoning. These models also focus on long context understanding, with Quen 3 reaching up to 256K tokens, and employ sophisticated training pipelines involving multiple stages.

Chameleon: Towards a discretely-tokenized omnimodel

Chameleon from Meta (2024) presents a different paradigm by aiming to map all modalities into discrete tokens, similar to text. It utilizes Vector Quantized Variational Autoencoders (VQ-VAEs) to convert images into sequences of discrete codes from a learned codebook. This allows for end-to-end training of a single language model on mixed text and image tokens. While aesthetically elegant and simplifying the training architecture, Chameleon faces challenges with training stability due to the inherent differences in entropy between text and image tokens. The discretization process also leads to information loss, particularly detrimental for tasks requiring fine-grained detail like OCR. Despite its innovative approach, the authors note that diffusion models currently offer better performance for generation tasks, and this VQ-VAE based approach is less popular for that reason.

Key challenges and future directions in multimodality

The lecture concludes by summarizing the core challenges and future directions in multimodal AI. A fundamental issue is effectively handling non-text modalities and balancing their different information densities to avoid overwhelming the model (e.g., video frames vs. text tokens). The choice of encoder is critical: high-level semantics for classification might be captured by smaller, well-aligned continuous encoders like CLIP, while tasks requiring fine detail, such as OCR or image generation, necessitate fine-grained representations, where diffusion models excel. The trend is towards increasingly complex, multi-stage training pipelines and larger context windows. The ultimate goal remains the development of natively multimodal or omnimodels, likely leveraging a combination of continuous encoders for rich representation and powerful generative models like diffusion, though the exact architectures of state-of-the-art closed-source models like Gemini and GPT-4 remain proprietary.

Mentioned in This Episode

●Software & Apps

●Organizations

Common Questions

Multimodality refers to AI models that can process and understand information from multiple types of data, such as text, images, audio, and video, going beyond just text-based inputs.

Topics

AI & Machine Learning Technology & Innovation Science & Mathematics Large Language Models Multimodal AI Computer Vision Vision-language Models Contrastive Learning Transformer Architectures AI Model Training

Mentioned in this video

Software & Apps

ChatGPT

Conversations with ChatGPT were used to synthesize data for training LLaVA models.

GPT-4

Used to synthesize training data for LLaVA by generating questions and conversations based on image captions and objects.

LLaVA 1.5

An evolution of LLaVA that improved handling of multiple images and videos, utilizing Si-CLIP as the vision encoder.

ResNet

A type of neural network architecture used for image classification; CLIP outperformed a ResNet trained on a large dataset in zero-shot evaluation.

Vit

A transformer-based architecture adapted for vision tasks, found to perform best in CLIP models.

GPT-3

Mentioned as a precursor to foundation models in the language domain before CLIP's development in vision.

OpenCLIP

An open-source replication and extension of CLIP, utilizing the LAION-5B dataset.

LLaVA

A family of vision-language models that inject image embeddings into a language model; LLaVA 1.5 can handle multiple images and videos.

Llama

The foundational language model used in early LLaVA versions, fine-tuned on shared GPT conversations.

Gemini

Mentioned as a contemporary multimodal frontier model known for its capabilities across various modalities.

Products

Chameleon

A model from Meta that maps images into discrete tokens, allowing for unified analysis and generation across modalities within a single language model framework.

Companies

OpenAI

Researchers at OpenAI developed the CLIP model, leveraging large amounts of image and textual captions from the internet.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free