Key Moments

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Latent Space PodcastLatent Space Podcast
People & Blogs5 min read61 min video
Nov 25, 2025|12,040 views|282|12
Save to Pod
TL;DR

World Labs launches Marble, a 3D generative world model, pushing towards spatial intelligence beyond LLMs.

Key Insights

1

World Labs, founded by Fei-Fei Li and Justin Johnson, is developing "world models" focused on spatial intelligence, extending beyond current large language models (LLMs).

2

Their product, Marble, is a generative 3D world model that creates editable environments from text/image inputs, offering tools for gaming, film, and simulation.

3

The advancement of AI is driven by scaling compute, with current capabilities allowing for vast model training that was impossible a decade ago.

4

Academia's role shifts towards exploring novel ideas and fundamental research, while industry focuses on productization and rapid model development, highlighting a need for resourcing academic AI.

5

Spatial intelligence is distinct from linguistic intelligence, focusing on understanding, reasoning, and interacting within 3D space, crucial for tasks beyond pure language processing.

6

While current generative models excel at pattern fitting, achieving true causal understanding of physics and dynamics remains a challenge, though it may emerge at scale or through specialized training.

THE EMERGENCE OF WORLD LABS AND MARBLE

World Labs, co-founded by AI pioneers Fei-Fei Li and Justin Johnson, is at the forefront of developing 'world models' aiming for spatial intelligence, a significant step beyond current large language models (LLMs). Their flagship product, Marble, is a generative model capable of creating interactive 3D environments from diverse inputs like text and images. This technology is designed for immediate use cases in gaming, visual effects, and film, while also laying the groundwork for more sophisticated future world models. Marble exemplifies their vision of building AI that understands and interacts with the physical world.

THE JOURNEY FROM ALEXNET TO SPATIAL INTELLIGENCE

The founders' collaboration stems from their shared academic roots at Stanford. Justin Johnson, a former student of Fei-Fei Li, noted that his PhD start coincided with the AlexNet breakthrough in 2012, an era defined by scaling compute and the shift to GPUs. This sparked an interest in moving AI beyond data centers into real-world applications, particularly in 3D vision and generative modeling. Their reunion years later at World Labs was driven by parallel explorations into the limitations of LLMs and a shared conviction that spatial intelligence and world models represent the next frontier in AI research.

THE EVOLUTION OF AI RESEARCH AND ECOSYSTEM DYNAMICS

The field of deep learning has been characterized by massive increases in available compute, enabling the training of models orders of magnitude larger than those from the AlexNet era. While open challenges and academic research remain vital for progress, the ecosystem now includes significant commercial pressure and industry-driven development. The founders acknowledge concerns about imbalanced resourcing for academia but emphasize that the diversity of approaches – from open-source initiatives to proprietary product development – is healthy. Academia's role is evolving towards exploring novel, 'wacky' ideas and theoretical underpinnings, rather than solely focusing on training the largest models.

FUNDAMENTAL CHALLENGES IN WORLD MODELING

A key challenge in building robust world models lies in imbuing them with true causal understanding, particularly of physics and spatial dynamics. While current models can generate plausible-looking scenes, they may not deeply comprehend underlying physical laws. This gap highlights the difference between pattern recognition and genuine understanding, especially for critical applications like engineering or architecture. The debate centers on whether physics engines should be integrated or if models can learn these principles implicitly through massive scale and diverse, interactive data, moving beyond mere pattern fitting.

MARBLE: A PRODUCT AND A GLIMPSE INTO THE FUTURE

Marble is positioned as both a practical product and a foundational step towards World Labs' grand vision of spatial intelligence. It offers multimodal input capabilities (text, images), precise camera control, and interactive scene editing, making it immediately useful for creative industries. The model natively outputs Gaussian splats, enabling real-time rendering on various devices, which is crucial for its interactive features. While current versions focus on plausible visual outputs, future iterations aim to incorporate more sophisticated physics, dynamics, and deeper understanding of spatial relationships.

SPATIAL INTELLIGENCE VERSUS LINGUISTIC INTELLIGENCE

Spatial intelligence is defined as the capability to reason, understand, move, and interact within space, seen as complementary to linguistic intelligence. Unlike LLMs that primarily process sequential tokens, spatial intelligence deals with the inherent structure and multi-dimensional nature of the physical world. Human intelligence is multi-faceted, including linguistic, logical, spatial, and emotional components. The ability to grasp a mug or deduce DNA structure relies heavily on spatial reasoning, a capability that is difficult to fully capture through language alone, underscoring the need for AI systems that excel in this domain.

THE INTERPLAY OF MODALITIES AND FUTURE ARCHITECTURES

The future of AI likely involves multimodal models that seamlessly integrate various forms of intelligence, including spatial and linguistic. While LLMs have demonstrated remarkable capabilities, they may struggle with tasks requiring deep spatial understanding. Conversely, models focused solely on spatial data might miss nuances that language can convey. Marble itself accepts language inputs, suggesting a path towards integrated systems. Looking ahead, architectures beyond simple sequence-to-sequence modeling, possibly leveraging transformers' set-based processing, will be crucial for building truly comprehensive world models that can reason effectively across different modalities and levels of abstraction.

APPLICATIONS AND POTENTIAL ACROSS INDUSTRIES

The potential applications of spatial intelligence and generative world models are vast, extending far beyond creative industries. Marble, for instance, is being explored for robotic training, offering a crucial source of synthetic data to overcome the 'data starvation' problem in embodied AI. Additionally, its capabilities are well-suited for architectural design, interior remodeling, and detailed simulation environments. The technology’s horizontal nature allows for emergent use cases, highlighting its adaptability and broad utility across diverse sectors and complex problem domains.

Common Questions

Marble is a generative model for 3D worlds developed by World Labs. It takes inputs like text or multiple images and generates a matching 3D world. It can also be interactively edited, allowing users to change elements within the generated scene.

Topics

Mentioned in this video

More from Latent Space

View all 87 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free