Why is simply scaling up robot data not enough for general-purpose robots?

While scale is necessary, data from sources like industrial automation, YouTube, or simulations lack the diversity of behaviors needed for open-world tasks. Industrial data is repetitive, YouTube data is passive observation without embodiment, and simulation data often lacks realism compared to the real world.

How did the laundry folding robot overcome initial failures?

The robot initially struggled with wrinkled clothes and varied item positions. A breakthrough came by pre-training on all available robot data and then fine-tuning on a high-quality, curated set of demonstration data, inspired by language modeling techniques.

How can robots succeed in environments they haven't been trained in?

Collecting diverse data from various environments (e.g., different rooms, kitchens, bedrooms) is key. By incorporating this diverse mobile manipulation data into a larger pre-training mixture, robots can generalize better to novel situations, even if that specific data is a small percentage of the total training set.

Can robots understand and respond to natural human language prompts and commands?

Yes, by leveraging hierarchical vision-language action models and generating synthetic data. Language models can relabel existing robot data with hypothetical human prompts, allowing robots to learn to follow a wider range of open-ended instructions and interjections.

What are the key infrastructure layers needed for deploying robots in the real world?

Essential infrastructure includes real-time systems for high-frequency action execution, fast inference capabilities on the robot itself, and robust large-scale machine learning infrastructure for training and ingesting multimodal data.

What are the main differences between robotics research in academia and industry?

Academic environments may have fewer resources for data collection and compute compared to startups and industry labs, but they allow for solving problems with less resource constraints. Industry offers greater throughput but can sometimes lead to more wasteful resource usage. Both have unique pros and cons.

Key Moments

Chelsea Finn: Building Robots That Can Do Anything

Q: What is the main challenge in developing general-purpose robots?

The primary challenge is that solving a specific robotics application often requires building an entire company around it. This involves creating new hardware, custom software, and unique movement primitives for each task, which is difficult and has led to limited success for many robotics companies.

Y Combinator

Science & Technology3 min read45 min video

Jul 22, 2025|96,539 views|2,499|67

YC Y Combinator

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Robots need general-purpose intelligence, not just scale, for real-world tasks. Diverse data and pre-training are key.

Key Insights

Developing general-purpose robots requires a foundation model approach, similar to language models, rather than task-specific solutions.

While scale is necessary, diverse and realistic data is more crucial than sheer volume for robots to generalize to real-world conditions.

A pre-training and curated post-training recipe is vital for complex robotic tasks like folding laundry, significantly improving performance.

Robots can generalize to unseen environments and tasks by training on diverse data, even if that data constitutes a small percentage of the total pre-training mix.

Open-ended prompts and interjections can be handled by robots through hierarchical models and synthetic data generation using language models.

Integrating world models and improving real-time inference infrastructure are key challenges for deploying robust robotic systems.

THE CHALLENGE OF SPECIALIZED ROBOTICS

The current paradigm for robotics applications often requires building an entire company for each specific task, from logistics to surgery. This involves developing custom hardware, software, movement primitives, and handling numerous edge cases. This highly specialized approach is difficult and has historically limited the success and widespread adoption of robots in daily life. Physical Intelligence aims to overcome this by developing a general-purpose model capable of enabling any robot to perform any task in any environment, mirroring the success of foundation models in language.

THE NECESSITY OF DIVERSE AND REALISTIC DATA

While scale is important for training generalizable models, simply scaling up data from industrial automation, YouTube, or simulations is insufficient. Industrial data lacks diversity for real-world applications like disaster response or grocery bagging. YouTube data, while abundant, doesn't provide the embodied learning needed. Simulated data often lacks realism. Therefore, the lesson learned is that scale is a necessary but subordinate factor; solving the problem requires diverse, realistic, and relevant data that captures the complexity of the physical world.

PRE-TRAINING AND POST-TRAINING FOR COMPLEX TASKS

For highly complex tasks such as folding laundry, a dual approach of pre-training on all available robot data followed by fine-tuning on a curated, high-quality set of demonstration data is crucial. This recipe, inspired by language model development, significantly improves robotic performance. Starting with simpler subtasks, like folding a single shirt, and gradually increasing complexity, combined with this pre-training and post-training strategy, unlocks capabilities that were previously unattainable with simpler methods.

GENERALIZATION TO UNSEEN ENVIRONMENTS AND TASKS

A significant advancement is enabling robots to succeed in environments they have never encountered. This is achieved by training on highly diverse datasets that include mobile manipulation data from various homes, kitchens, and bedrooms, even if this data represents a small fraction of the total pre-training mix. The key is that this diverse data, along with static manipulation and web data, allows the model to build a general understanding, leading to improved performance in novel situations. Preserving the capabilities of pre-trained vision-language models is also vital for effective language following.

RESPONDING TO OPEN-ENDED PROMPTS AND INTERJECTIONS

To allow robots to handle user-defined tasks and dynamic interactions, hierarchical vision-language-action models are employed. These models break down open-ended prompts into intermediate subtasks, executing them with a low-level policy. Synthetic data, generated by language models that re-label existing robot data with hypothetical human prompts, plays a crucial role in training the high-level policy. This approach enables robots to understand and respond to complex instructions, modifications, and real-time corrections, going beyond a fixed set of commands.

FUTURE CHALLENGES AND OPPORTUNITIES

Despite significant progress, challenges remain, including improving reliability, speed, and handling partial observability and long-term planning. Key opportunities lie in developing better robotic infrastructure, contributing to open-source models and datasets, and exploring synthetic data for evaluation and reinforcement learning. Research into integrating world models, ensuring safety, and scaling real-time inference are critical next steps for truly deployable, general-purpose robots in the open world.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

Comparison of Pre-training and Post-training Strategies for Robot Task Performance

Data extracted from this episode

Strategy	Performance (Task Progress)	Evaluation Context
Pre-training and Post-training (Combined)	High Performance (Reliably flatten and fold objects)	Evaluated robot task performance
No Pre-training (Only Curated Data)	Minimal Progress (Only able to get item out of bin)	Evaluated robot task performance
No Post-training (All Data)	Minimal Progress (Only able to get item out of bin)	Evaluated robot task performance
Full Pre-training Mixture	Higher Performance (>20% increase)	Evaluated in novel homes
Excluding Static Robot Data	Significantly Reduced Performance (<60%)	Evaluated in novel homes
Increased Diversity of Homes	Performance Increases (Closes generalization gap)	Evaluated in novel homes

Common Questions

The primary challenge is that solving a specific robotics application often requires building an entire company around it. This involves creating new hardware, custom software, and unique movement primitives for each task, which is difficult and has led to limited success for many robotics companies.

Topics

General Intelligence Real-world Applications

Mentioned in this video

Organizations

Physical Intelligence

Software & Apps

Polygeemma

An open-source, three-billion-parameter vision-language model used as a foundation for a robot control system. It takes images and language commands as input to predict future actions.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free