Chelsea Finn: Building Robots That Can Do Anything

Y CombinatorY Combinator
Science & Technology3 min read45 min video
Jul 22, 2025|89,641 views|2,342|65
Save to Pod

Key Moments

TL;DR

Robots need general-purpose intelligence, not just scale, for real-world tasks. Diverse data and pre-training are key.

Key Insights

1

Developing general-purpose robots requires a foundation model approach, similar to language models, rather than task-specific solutions.

2

While scale is necessary, diverse and realistic data is more crucial than sheer volume for robots to generalize to real-world conditions.

3

A pre-training and curated post-training recipe is vital for complex robotic tasks like folding laundry, significantly improving performance.

4

Robots can generalize to unseen environments and tasks by training on diverse data, even if that data constitutes a small percentage of the total pre-training mix.

5

Open-ended prompts and interjections can be handled by robots through hierarchical models and synthetic data generation using language models.

6

Integrating world models and improving real-time inference infrastructure are key challenges for deploying robust robotic systems.

THE CHALLENGE OF SPECIALIZED ROBOTICS

The current paradigm for robotics applications often requires building an entire company for each specific task, from logistics to surgery. This involves developing custom hardware, software, movement primitives, and handling numerous edge cases. This highly specialized approach is difficult and has historically limited the success and widespread adoption of robots in daily life. Physical Intelligence aims to overcome this by developing a general-purpose model capable of enabling any robot to perform any task in any environment, mirroring the success of foundation models in language.

THE NECESSITY OF DIVERSE AND REALISTIC DATA

While scale is important for training generalizable models, simply scaling up data from industrial automation, YouTube, or simulations is insufficient. Industrial data lacks diversity for real-world applications like disaster response or grocery bagging. YouTube data, while abundant, doesn't provide the embodied learning needed. Simulated data often lacks realism. Therefore, the lesson learned is that scale is a necessary but subordinate factor; solving the problem requires diverse, realistic, and relevant data that captures the complexity of the physical world.

PRE-TRAINING AND POST-TRAINING FOR COMPLEX TASKS

For highly complex tasks such as folding laundry, a dual approach of pre-training on all available robot data followed by fine-tuning on a curated, high-quality set of demonstration data is crucial. This recipe, inspired by language model development, significantly improves robotic performance. Starting with simpler subtasks, like folding a single shirt, and gradually increasing complexity, combined with this pre-training and post-training strategy, unlocks capabilities that were previously unattainable with simpler methods.

GENERALIZATION TO UNSEEN ENVIRONMENTS AND TASKS

A significant advancement is enabling robots to succeed in environments they have never encountered. This is achieved by training on highly diverse datasets that include mobile manipulation data from various homes, kitchens, and bedrooms, even if this data represents a small fraction of the total pre-training mix. The key is that this diverse data, along with static manipulation and web data, allows the model to build a general understanding, leading to improved performance in novel situations. Preserving the capabilities of pre-trained vision-language models is also vital for effective language following.

RESPONDING TO OPEN-ENDED PROMPTS AND INTERJECTIONS

To allow robots to handle user-defined tasks and dynamic interactions, hierarchical vision-language-action models are employed. These models break down open-ended prompts into intermediate subtasks, executing them with a low-level policy. Synthetic data, generated by language models that re-label existing robot data with hypothetical human prompts, plays a crucial role in training the high-level policy. This approach enables robots to understand and respond to complex instructions, modifications, and real-time corrections, going beyond a fixed set of commands.

FUTURE CHALLENGES AND OPPORTUNITIES

Despite significant progress, challenges remain, including improving reliability, speed, and handling partial observability and long-term planning. Key opportunities lie in developing better robotic infrastructure, contributing to open-source models and datasets, and exploring synthetic data for evaluation and reinforcement learning. Research into integrating world models, ensuring safety, and scaling real-time inference are critical next steps for truly deployable, general-purpose robots in the open world.

Comparison of Pre-training and Post-training Strategies for Robot Task Performance

Data extracted from this episode

StrategyPerformance (Task Progress)Evaluation Context
Pre-training and Post-training (Combined)High Performance (Reliably flatten and fold objects)Evaluated robot task performance
No Pre-training (Only Curated Data)Minimal Progress (Only able to get item out of bin)Evaluated robot task performance
No Post-training (All Data)Minimal Progress (Only able to get item out of bin)Evaluated robot task performance
Full Pre-training MixtureHigher Performance (>20% increase)Evaluated in novel homes
Excluding Static Robot DataSignificantly Reduced Performance (<60%)Evaluated in novel homes
Increased Diversity of HomesPerformance Increases (Closes generalization gap)Evaluated in novel homes

Common Questions

The primary challenge is that solving a specific robotics application often requires building an entire company around it. This involves creating new hardware, custom software, and unique movement primitives for each task, which is difficult and has led to limited success for many robotics companies.

Topics

Mentioned in this video

More from Y Combinator

View all 109 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free