What are the essential steps for building a new image or video generation model?

The core steps involve generating synthetic text-image/video pairs through detailed human or VLM captioning, training a VAE to compress visual data into a latent space, and then training a diffusion transformer model on these latent and language tokens, often bootstrapping video models from pre-trained image models.

What is the true cost of training large video models?

Training large video models is comparable in cost to medium-scale language models. Significant expenses come from storing petabytes of video data and their compressed features (hundreds of thousands of dollars per month for storage and egress) in addition to high GPU compute costs, making data loading and caching optimizations crucial.

How can inference be made more efficient for diffusion models?

Efficiency gains in inference primarily come from distillation techniques like 'step distillation'. This involves teaching a smaller model to generate high-quality outputs in fewer steps (e.g., 4-8 steps instead of 100 or 1000) by learning from a larger, slower 'teacher' model.

What are the main challenges in audio-video joint generation?

Key challenges include modality alignment (precisely connecting audio and video tokens over time), the continuous nature of music (which is hard to model with discrete tokens), and the difficulty of gathering detailed synthetic captioning data for audio, as current LLMs are poor at describing musical nuances.

What defines a 'world model' in the context of video AI?

A world model, in Ethan He's definition, is characterized by real-time, interactive, and long-horizon video generation. It should allow interaction via various modalities, respond instantly to user input, and generate content spanning minutes or hours while maintaining consistency.

How does xAI Grok Imagine address long-range consistency in videos?

Grok Imagine's 'video extension' feature incorporates historical context from previous generated videos to inform subsequent frames, maintaining consistency for characters, objects, and dialogues over longer durations, unlike naive methods that only use the last frame.

How do language models contribute to visual intelligence in video generation?

Visual intelligence in video models largely stems from language models, which act as 'prompt rewriters' or 'up-samplers'. They take simple user instructions and expand them into extremely detailed descriptions that 'dumb' video diffusion models can literally follow, enhancing the quality and relevance of generated content.

What is the vision for video agents and their future?

Video agents are envisioned as advanced language models that can 'call' generative models (either as separate tools or integrated heads) to iteratively refine video generation. They can also integrate traditional editing tools (like Photoshop or FFmpeg) and manage complex, long-form video creation, eventually leading to fully automated professional-grade video production.

Why did Ethan He leave xAI to focus on language models?

Ethan He left xAI due to a desire to pursue research areas, particularly on the language model side, that were not aligned with company priorities. He realized that the primary gains in advanced video models were increasingly coming from improvements in language models and agentic capabilities, rather than core diffusion technology.

What are the next big developments expected in language models?

The next significant advancements in language models are predicted to involve becoming context-aware and managing their own context. This includes features like automatic context compaction, removal, and addition, potentially leading to models that implicitly understand the length of their context and can self-modify their agentic harnesses.

Key Moments

Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

Latent Space Podcast

Science & Technology6 min read105 min video

Jun 1, 2026|1,695 views|70|11

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

The core intelligence in advanced video generation models appears to stem from language models, not the video diffusion components themselves, suggesting a shift in focus for future AI development towards enhancing LLM capabilities.

Key Insights

The majority of improvements in current video generation models like Grok Imagine come from advances in language models, rather than the core video diffusion technology.

Building a state-of-the-art video generation model like Grok Imagine from scratch took approximately three months, heavily relying on a strong, cohesive engineering team and robust infrastructure.

Storing and moving massive video datasets for training can cost millions of dollars per month due to data size and egress fees, rivaling GPU compute costs.

Video agents, which can iteratively refine results and leverage various tools (including other generative models), are seen as the next frontier, moving beyond simple frame generation to production-grade content creation.

The complex alignment of audio, video, and text modalities is a significant challenge in multimodal AI, particularly with audio's continuous and discrete components.

Real-time, interactive, long-horizon video generation is the ultimate goal of 'world models,' enabling complex interactions like gaming or navigating generated virtual environments.

Language as the primary driver of visual intelligence

A bold claim suggests that current advancements in visual intelligence for video generation models, particularly those using mature diffusion technologies, are primarily driven by underlying language models rather than the video diffusion mechanisms themselves. Ethan He explains that in systems like Cosmos, a significant portion of 'thinking' and refinement comes from prompt rewriting and upsampling components, often larger and more sophisticated language models. These language models translate user instructions into detailed descriptions that the 'dumber,' more literal video diffusion models can then execute. This implies that for models like Grok Imagine (7B parameters), the LLM component (which could be larger) plays a crucial role in expanding simple prompts into complex, detailed visual representations, indicating that future gains in video generation may hinge more on LLM advancements than solely on diffusion architecture improvements.

Rapid development of frontier models from scratch

The genesis of xAI's Grok Imagine exemplifies the feasibility of building complex generative models rapidly. Ethan He recounts that building the first version, Grok Imagine 0.9, from 'zero to one' took merely three months. This accelerated timeline was attributed to assembling a team of exceptionally talented and closely-knit engineers who could work efficiently towards a common goal, minimizing communication overhead. The existence of strong foundational infrastructure at xAI, encompassing data pipelines, inference capabilities, and compute resources, was also critical. This rapid iteration cycle, enabled by robust infrastructure, allowed for faster training and quicker identification of bugs, underscoring the importance of both human talent and a well-prepared technical environment for frontier AI development.

The immense cost and complexity of video data handling

Storing and managing the vast datasets required for training video models presents a significant financial and logistical challenge. Estimates suggest that storing a mere billion videos, each around five megabytes, could consume five petabytes of data, costing upwards of $100,000 per month on cloud storage like AWS S3. This cost escalates further when considering the storage of compressed latent space representations from VAEs and egress fees for downloading data, potentially running into millions per month. Beyond storage, the sheer volume of data movement (IOPS) can render training processes 'IO-bound,' making them less efficient. Optimizations are crucial, but the underlying scale of video data means that infrastructure costs related to storage and data transfer are comparable to, if not exceeding, the expense of GPU compute hours.

Video agents: The next logical step in generative AI

The trajectory of generative AI is moving towards 'video agents'—systems that go beyond generating static sequences of frames. These agents are envisioned as more sophisticated entities that can iteratively refine results, manage context, and utilize a suite of tools, including diffusion models, traditional editing software (like FFmpeg), and even other generative models. This approach mirrors human creative processes, where raw generated content is post-processed and edited to achieve production-grade quality. The Grok Imagine video extension and agent beta are early steps in this direction, allowing for longer-form content creation by understanding historical context and enabling interactive editing. The future holds agents that can self-modify their harnesses, program themselves at test time, and leverage LLMs to intelligently prompt and orchestrate various generative and editing tools to create complex, polished video content.

World models: Enabling real-time, interactive, long-horizon experiences

The concept of 'world models' represents the ultimate frontier in real-time interactive video generation. Ethan He defines these models by three core characteristics: interactivity, real-time responsiveness, and long-horizon generation. This translates to systems where users can interact via keyboard, mouse, or voice, and the model responds instantaneously (ideally within milliseconds for gaming, or a more generous 200ms for digital humans). Crucially, these models must also generate content that extends over minutes or hours, not just seconds. Examples like Flipbook and Neuro OS, which simulate interactive web browsers or operating systems with generated UIs, showcase early steps towards this vision. Achieving this requires overcoming significant challenges in managing context windows and temporal compression without introducing lag, enabling AI to create dynamic, responsive virtual environments.

Multimodal alignment: The challenge of integrating diverse data types

Integrating different data modalities like text, images, audio, and video presents a substantial hurdle in AI development. While text-to-image and text-to-video alignments are becoming more robust, incorporating audio remains particularly challenging. Audio has both discrete components (like speech, which can be represented as text tokens with some characteristics) and continuous components (like music), which are difficult to model within traditional discrete token frameworks. Furthermore, achieving precise temporal alignment between modalities—knowing exactly what audio corresponds to which video frame at a specific time step—is not naturally present in most internet data. Generating synthetic data and creating models that can accurately capture nuances like musical beats, tone, and dialogue, while maintaining cross-modal consistency, is an active area of research.

Efficiency, distillation, and the path to faster inference

Reducing the computational cost, particularly for inference, is critical for deploying advanced generative models. Techniques like 'step distillation' are key, where a smaller, faster model learns to mimic the output of a larger, more complex teacher model over fewer steps. For instance, a model trained to generate video in 10 steps can learn from a 100-step model, simplifying the target distribution from the entire internet's complexity to just the teacher model's output. This strong-to-weak learning paradigm is also seen in approaches like Generative Adversarial Networks (GANs), where a discriminator provides a single-step feedback loop. Consistency models and other distillation methods aim to achieve production-level quality with significantly fewer computational steps, making real-time applications and widespread deployment more feasible.

The evolving role of language models and the 'black pill' for media researchers

Ethan He posits that the primary 'black pill' for generative media researchers is the realization that much of the intelligence perceived in advanced video or image generation comes from the underlying language models, not the diffusion models themselves. This suggests a potential bottleneck in the visual component's reasoning ability, which is augmented by sophisticated LLMs. He notes that while video models are literal interpreters of instructions, powerful prompt rewriting LLMs can transform simple user requests into detailed, actionable descriptions, leading to significantly better visual outputs. This emphasis on language intelligence over diffusion architecture implies that future breakthroughs in multimodal AI might depend more heavily on advancements in LLM reasoning, context management, and agentic capabilities, prompting a strategic shift in research focus.

Mentioned in This Episode

●Software & Apps

●Companies

●Concepts

●People Referenced

Common Questions

xAI built the first version of their multimodal model, Grok Imagine 0.9, in just three months with a small team. This was possible due to strong talent, efficient infrastructure, and fast iteration cycles.

Topics

Ai Agents AI & Machine Learning Technology & Innovation Programming & Software Large Language Models Model Training Diffusion Models World Models Multimodal AI Inference Optimization Video Generation

Mentioned in this video

Software & Apps

ChatGPT

Mentioned in comparison to Grok's voice mode for real-time interaction capabilities.

GPT Image

Mentioned as an auto-regressive language model with a diffusion head, distinguishing its architecture from prompt rewriter-based image generation.

Neuro OS

A project that simulates an entire operating system using a video model, allowing users to interact with imagined interfaces like playing Doom or using Firefox.

Sora

A video generation model whose audio matchup with video content is criticized for lacking realism, indicating a current imperfection in AI-generated media.

Cloud Code

An AI coding tool, mentioned in the context of prompt pruning and the evolution from AI-assisted coding to fully automated solutions.

Photoshop

A traditional image editing tool mentioned as something video agents could leverage in combination with generative AI for production-grade content.

Cosmos

A giant video foundation model built at NVIDIA, aiming to simulate the world for robotics, which Ethan He helped develop and realized had scaling laws similar to language models.

Grok Imagine 0.9

The first multimodal model released by xAI, combining audio and video generation, developed by a small team in three months.

Gemini

Google's AI model, mentioned in comparison to Grok's voice mode and as an Omni model with a diffusion head.

Grok Voice

xAI's voice mode functionality, praised for its interruption handling and real-time interaction, especially in a Tesla context.

SynthID

A watermarking technology, originally from Google, for detecting AI-generated content, noting its limitation of being reverse-engineerable.

ffmpeg

A traditional video editing tool that video agents might use for stitching clips together, rather than relying solely on generative models.

GitHub Copilot

An AI-assisted coding tool mentioned as an example of how AI assistance can gradually evolve into full automation, similar to the trajectory of video agents.

Grok search

xAI's search capability, used by the host to find Ethan He's LinkedIn post about 'reference to video'.

Megatron LM

An open-source framework developed at NVIDIA, which Ethan He worked on, focused on training large models efficiently at scale (100 billion to trillions of parameters).

People

Elon Musk

CEO of xAI, known for his 'first principle thinking' approach and hands-on involvement with his teams.

Ian Goodfellow

Credited with the adversarial GAN concept, illustrating how small changes can drastically alter AI perception, used in the context of AI-generated media detection.

Concepts

ResNet

A deep learning framework, whose authors (Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun) Ethan He worked with 10 years prior on computer vision research.

MP4

A video compression format that utilizes temporal redundancy, but its latent space is hard for models to comprehend directly.

Vision Transformer

A model architecture employing patch-based processing for images, allowing Transformer networks to be applied to visual data.

Flipbook

A frontier application demonstrating real-time generative UI in a web browser-like environment, where interfaces are imagined and generated by AI models.

Large Language Model

Discussed as the driving force behind many advancements in visual intelligence, particularly through prompt rewriting and agentic capabilities.

Companies

XAI

The AI company where Ethan He most recently worked, known for building Grok and its multimodal models.

NVIDIA

Ethan He's previous employer, where he worked on Cosmos world models and large-scale GPU training frameworks.

OpenClaw

A platform or project mentioned in the context of self-modifying harnesses and time-aware models for future language models.

Products

Raspberry Pi

Mentioned alongside OpenClaw as systems that are exploring self-modifying harnesses.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free