How did Fei-Fei Li and Justin Johnson found World Labs?

World Labs was founded by Fei-Fei Li and Justin Johnson. Justin was Fei-Fei's former PhD student. They both independently recognized the potential beyond language models and decided to focus on building world models and spatial intelligence.

What has enabled the recent advancements in world models?

Advancements in world models are driven by increased availability of data and compute power. The ability to train models on thousands of GPUs and process vast amounts of visual and spatial data has been crucial.

What is the difference between open science in academia and industry research?

While academia often prioritizes open datasets and benchmarks (like the 'Behavior' dataset), industry may keep certain advancements proprietary for business models. Both approaches contribute to the AI ecosystem, offering different types of progress.

How has the role of academia in AI changed?

Academia's role has shifted from training state-of-the-art models to exploring 'wacky' and novel ideas. With massive compute centralized in industry, academics are better positioned to focus on theoretical underpinnings, new algorithms, and foundational research.

What are some novel ideas for future AI hardware and architectures?

Instead of relying solely on matrix multiplication for GPUs, future systems might explore different primitives suited for large-scale distributed computing. This could lead to drastically different neural network architectures adapted for next-generation hardware.

What was the significance of the early image captioning work?

The pioneering work combined convolutional neural networks for image representation with LSTMs for language generation, allowing models to describe images in a sentence. This research was developed independently and simultaneously with Google's efforts.

Why is spatial intelligence considered different from language-based AI?

Spatial intelligence involves understanding, reasoning, and interacting in 3D space, which is fundamentally different from the 1D, token-based processing of LLMs. It requires modeling spatial structures and dynamics, crucial for tasks beyond pure language comprehension.

Can current AI models truly understand physics or causality?

While models can learn to fit patterns and make predictions (like planetary orbits), they may not grasp the underlying causal laws (like gravity). True understanding and generalization beyond training data remain challenges, especially for real-world applications requiring physical accuracy.

What are the atomic units of generation in Marble?

Currently, Marble natively outputs Gaussian splats, which are individual particles rendered efficiently in real-time. Future iterations might use different atomic units like frames or tokens representing chunks of 3D space.

How can spatial intelligence be applied in robotics?

Marble can generate high-fidelity synthetic worlds for robotics training, addressing the 'data starvation' problem. This allows embodied agents to interact with diverse simulated environments, improving their ability to learn and adapt.

What is the difference between human and AI intelligence regarding spatial reasoning?

Human spatial intelligence stems from embodied experience and evolved perceptual systems, allowing for intuitive interaction with the 3D world. AI, particularly LLMs, primarily processes abstract patterns and language, lacking the direct embodied understanding and self-awareness of human cognition.

Is sequence-to-sequence modeling outdated for world models?

While sequence-to-sequence has been influential, world models may evolve beyond it. Transformers, however, are not strictly sequence models but rather models of sets, with order often injected via positional embeddings. Attention mechanisms remain valuable.

Key Moments

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Latent Space Podcast

People & Blogs5 min read61 min video

Nov 25, 2025|12,262 views|289|12

Save to Pod

Key Moments

TL;DR

World Labs launches Marble, a 3D generative world model, pushing towards spatial intelligence beyond LLMs.

Key Insights

World Labs, founded by Fei-Fei Li and Justin Johnson, is developing "world models" focused on spatial intelligence, extending beyond current large language models (LLMs).

Their product, Marble, is a generative 3D world model that creates editable environments from text/image inputs, offering tools for gaming, film, and simulation.

The advancement of AI is driven by scaling compute, with current capabilities allowing for vast model training that was impossible a decade ago.

Academia's role shifts towards exploring novel ideas and fundamental research, while industry focuses on productization and rapid model development, highlighting a need for resourcing academic AI.

Spatial intelligence is distinct from linguistic intelligence, focusing on understanding, reasoning, and interacting within 3D space, crucial for tasks beyond pure language processing.

While current generative models excel at pattern fitting, achieving true causal understanding of physics and dynamics remains a challenge, though it may emerge at scale or through specialized training.

THE EMERGENCE OF WORLD LABS AND MARBLE

World Labs, co-founded by AI pioneers Fei-Fei Li and Justin Johnson, is at the forefront of developing 'world models' aiming for spatial intelligence, a significant step beyond current large language models (LLMs). Their flagship product, Marble, is a generative model capable of creating interactive 3D environments from diverse inputs like text and images. This technology is designed for immediate use cases in gaming, visual effects, and film, while also laying the groundwork for more sophisticated future world models. Marble exemplifies their vision of building AI that understands and interacts with the physical world.

THE JOURNEY FROM ALEXNET TO SPATIAL INTELLIGENCE

The founders' collaboration stems from their shared academic roots at Stanford. Justin Johnson, a former student of Fei-Fei Li, noted that his PhD start coincided with the AlexNet breakthrough in 2012, an era defined by scaling compute and the shift to GPUs. This sparked an interest in moving AI beyond data centers into real-world applications, particularly in 3D vision and generative modeling. Their reunion years later at World Labs was driven by parallel explorations into the limitations of LLMs and a shared conviction that spatial intelligence and world models represent the next frontier in AI research.

THE EVOLUTION OF AI RESEARCH AND ECOSYSTEM DYNAMICS

The field of deep learning has been characterized by massive increases in available compute, enabling the training of models orders of magnitude larger than those from the AlexNet era. While open challenges and academic research remain vital for progress, the ecosystem now includes significant commercial pressure and industry-driven development. The founders acknowledge concerns about imbalanced resourcing for academia but emphasize that the diversity of approaches – from open-source initiatives to proprietary product development – is healthy. Academia's role is evolving towards exploring novel, 'wacky' ideas and theoretical underpinnings, rather than solely focusing on training the largest models.

FUNDAMENTAL CHALLENGES IN WORLD MODELING

A key challenge in building robust world models lies in imbuing them with true causal understanding, particularly of physics and spatial dynamics. While current models can generate plausible-looking scenes, they may not deeply comprehend underlying physical laws. This gap highlights the difference between pattern recognition and genuine understanding, especially for critical applications like engineering or architecture. The debate centers on whether physics engines should be integrated or if models can learn these principles implicitly through massive scale and diverse, interactive data, moving beyond mere pattern fitting.

MARBLE: A PRODUCT AND A GLIMPSE INTO THE FUTURE

Marble is positioned as both a practical product and a foundational step towards World Labs' grand vision of spatial intelligence. It offers multimodal input capabilities (text, images), precise camera control, and interactive scene editing, making it immediately useful for creative industries. The model natively outputs Gaussian splats, enabling real-time rendering on various devices, which is crucial for its interactive features. While current versions focus on plausible visual outputs, future iterations aim to incorporate more sophisticated physics, dynamics, and deeper understanding of spatial relationships.

SPATIAL INTELLIGENCE VERSUS LINGUISTIC INTELLIGENCE

Spatial intelligence is defined as the capability to reason, understand, move, and interact within space, seen as complementary to linguistic intelligence. Unlike LLMs that primarily process sequential tokens, spatial intelligence deals with the inherent structure and multi-dimensional nature of the physical world. Human intelligence is multi-faceted, including linguistic, logical, spatial, and emotional components. The ability to grasp a mug or deduce DNA structure relies heavily on spatial reasoning, a capability that is difficult to fully capture through language alone, underscoring the need for AI systems that excel in this domain.

THE INTERPLAY OF MODALITIES AND FUTURE ARCHITECTURES

The future of AI likely involves multimodal models that seamlessly integrate various forms of intelligence, including spatial and linguistic. While LLMs have demonstrated remarkable capabilities, they may struggle with tasks requiring deep spatial understanding. Conversely, models focused solely on spatial data might miss nuances that language can convey. Marble itself accepts language inputs, suggesting a path towards integrated systems. Looking ahead, architectures beyond simple sequence-to-sequence modeling, possibly leveraging transformers' set-based processing, will be crucial for building truly comprehensive world models that can reason effectively across different modalities and levels of abstraction.

APPLICATIONS AND POTENTIAL ACROSS INDUSTRIES

The potential applications of spatial intelligence and generative world models are vast, extending far beyond creative industries. Marble, for instance, is being explored for robotic training, offering a crucial source of synthetic data to overcome the 'data starvation' problem in embodied AI. Additionally, its capabilities are well-suited for architectural design, interior remodeling, and detailed simulation environments. The technology’s horizontal nature allows for emergent use cases, highlighting its adaptability and broad utility across diverse sectors and complex problem domains.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Studies Cited

●Concepts

●People Referenced

Common Questions

Marble is a generative model for 3D worlds developed by World Labs. It takes inputs like text or multiple images and generates a matching 3D world. It can also be interactively edited, allowing users to change elements within the generated scene.

Topics

Generative Models 3D Generation

Mentioned in this video

Studies & Research

CVPR 2015

The venue where the first image captioning paper from Fei-Fei Li's lab, utilizing CNNs and LSTMs, was presented.

Behavior Dataset

An open dataset and benchmark for robotic learning in simulated environments, developed by Fei-Fei Li's Stanford lab.

ICCV 15

A conference where Justin Johnson demonstrated a real-time image captioning demo using a laptop connected to a server across the country.

CVPR 2016

The venue where the paper on dense captioning, extending image captioning to describe multiple regions in an image, was presented.

People

Giuseppe Longo

A researcher mentioned as a prominent proponent of world models.

Young Lagoon

Mentioned as a prominent proponent of the idea of world models.

Software & Apps

RTFM model

Another model developed by World Labs that generates frames one at a time, contrasting with the splat-based approach of Marble.

RNN

Recurrent Neural Network, an earlier architecture used for sequential data processing, which was explored in language modeling research.

Linux source code

Used as a dataset to train an RNN language model to understand internal structures and neuron firing patterns, demonstrating early analysis of neural networks.

LSTM

A type of recurrent neural network (RNN) used in early language modeling and combined with CNNs for image captioning tasks.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free