Key Moments
Robots Are Finally Starting to Work
Key Moments
Robots are now performing complex tasks like folding laundry and packaging orders with near-human capabilities, but each successful deployment still requires significant human oversight and is costly to scale.
Key Insights
Physical Intelligence's cross-embodiment approach, training across multiple robot platforms, resulted in a 50% improvement in performance compared to single-embodiment specialists.
Tasks that previously required hundreds of hours of data collection for robots can now be performed zero-shot, meaning without prior specific training for that task.
Even with advanced AI, a 'mixed autonomy system' with human oversight is currently necessary for robot deployment, with the goal of gradually increasing autonomy.
Cloud-hosted models are enabling real-time robotic control, overcoming the limitations of on-robot compute power and the rapid obsolescence of hardware.
The upfront cost of building a robotics company has significantly decreased, with the focus shifting from proprietary autonomy stacks to identifying use cases and collecting relevant data.
Physical Intelligence has open-sourced its foundation models (PI0 and PI05) to accelerate progress and foster a 'Cambrian explosion' of robotics startups.
Robots are achieving near-human capabilities, performing complex tasks with surprising efficiency.
The field of robotics is on the cusp of a significant breakthrough, moving beyond specialized industrial applications to tackle complex, everyday tasks. Companies like Physical Intelligence (PI) are developing foundation models that can control any robot to perform physically capable tasks with high performance. This progress is exemplified by demonstrations of robots folding diverse laundry items in real-world laundromats and packaging consumer goods in e-commerce warehouses. These are not simple tasks; laundry items are deformable and infinitely varied, and packaging requires precise manipulation, including nudging items into pouches through narrow openings. These achievements were previously considered the 'Turing test' for robotics, requiring deterministic programming approaches that struggled with the infinite variability of the real world. The possibility of robots performing such tasks was, for many, only conceivable after the advancements seen with models like ChatGPT.
Language models are unlocking common sense and planning for robots.
A key enabler for the recent progress in robotics is the integration of large language models (LLMs) and vision-language models. Semantics, one of the three pillars considered crucial for robotics (alongside planning and real-time control), has seen significant unlocks through LLMs. These models bring common sense knowledge into robotics, drastically reducing the need for task-specific data collection. For instance, an LLM can break down a task like 'record a podcast' into actionable steps and plans. Models like RT2 (Robotic Transformer 2) and PaLM-E have demonstrated the ability to transfer knowledge from vision-language models to low-level robotic actions. This allows robots to perform tasks involving concepts not explicitly present in their training data, such as identifying and interacting with specific celebrities in images or performing spatial reasoning with unseen objects.
Cross-embodiment training yields a 50% performance boost over specialists.
A significant challenge in robotics has been scaling data collection and ensuring models generalize across different hardware. Physical Intelligence's 'open cross-embodiment' approach trains models on data from multiple robot platforms simultaneously. This strategy leads to models that learn a more abstract understanding of robot control, rather than optimizing for a single, specific robot. Surprisingly, research from this approach, specifically the 'Robotic Transformer X' paper, showed that a generalist model trained across 10 different robot platforms performed 50% better than specialist models optimized for individual platforms. This is a paradigm shift, as traditionally, every new robot platform would add years to development due to the extensive effort required to get it operational and collect data. The implication is that by leveraging data from a diverse fleet, models become more robust and adaptable to variations and changes in hardware over time.
Cloud-based AI powers robots, overcoming hardware limitations.
Contrary to the expectation that robots need substantial on-board computing power, PI's models, even for complex tasks, are largely hosted in the cloud. This approach circumvents the high cost, rapid obsolescence, and power demands of on-robot compute hardware. Robots query API endpoints in the cloud, sending sensor data and receiving action commands in real-time. PI has developed algorithmic improvements, such as 'real-time chunking' and pre-computation within the control loop, to manage the inherent latency of cloud inference. This allows for smooth, continuous action with as little as 50 milliseconds of action worth remaining before querying the next set of actions. This strategy simplifies robot hardware significantly, potentially enabling even basic 'dumb' computers and cameras to stream data for sophisticated AI control, a crucial factor for mass deployment.
A new playbook emerges for vertical robotics startups.
The landscape for starting robotics companies has been dramatically reshaped. Historically, robotics required immense vertical integration, encompassing customer relationships, hardware development, autonomy stacks, and safety certifications, creating a high barrier to entry. PI aims to lower this barrier by providing a foundational intelligence layer that the community can build upon. The new playbook involves understanding existing workflows, meticulously identifying opportunities where robots can provide maximum impact, and being scrappy with hardware and data collection. The reactive nature of modern AI models can compensate for inaccuracies in less expensive robot movements. The key is to achieve 'break-even' economically through mixed autonomy systems (where humans provide oversight and corrections) before scaling robot deployment, a historical challenge due to long payback periods. This unbundling allows startups to focus on differentiation rather than reinventing the entire robotics stack.
The promise of a 'Cambrian explosion' in robotics.
Physical Intelligence's mission is to enable a widespread proliferation of robotics companies, akin to a 'Cambrian explosion.' By providing open-source foundation models (PI0 and PI05) and publishing their research, they aim to accelerate progress across the industry. The belief is that robots can bring abundance to the physical world, mirroring the impact of digital computing. The success of PI is not solely defined by its own models on its robots, but by the broader adoption and application of its technology by other companies and researchers. This vision addresses the argument that the fundamental problem of general-purpose robotics might take decades longer to solve, emphasizing community enablement and collaborative progress. The hope is to inspire a new wave of startups that can leverage these advancements to solve a myriad of real-world problems, especially in sectors with labor shortages.
The human element: a diverse team tackling a monumental challenge.
PI is an unconventional company with a larger than average founding team, many of whom previously worked together at Google's robotics team. This established synergy and shared experience, coupled with a passion for the mission, proved crucial for tackling the immense complexity of general-purpose robotics. Key members include Adnan, the hardware lead tasked with managing a fleet of heterogeneous robots, and Locky, who ensures the business aspects are sound. The team emphasizes a 'divide and conquer' strategy, leveraging individual strengths to accelerate progress. For many co-founders, this is their first startup experience, revealing a surprising lack of existing infrastructure for large-scale, general-purpose robotics. This has necessitated PI building much of its own software for data collection, management, annotation, and evaluation, highlighting an opportunity for future support services businesses in the robotics ecosystem.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●Studies Cited
●Concepts
●People Referenced
Generalist vs. Specialist Robot Policy Performance
Data extracted from this episode
| Model Type | Performance Improvement | Notes |
|---|---|---|
| Generalist (trained across 10 platforms) | 50% better | Outperformed specialists optimized for a single platform. |
Common Questions
The 'GPT-1 moment' for robotics refers to a breakthrough similar to the impact of GPT-1 on language models, leading to a general-purpose AI model that can control any robot to perform any physically capable task with high performance.
Topics
Mentioned in this video
A transformative moment for robotics akin to the GPT-1 moment in language models, signifying a leap in general-purpose robotic capabilities.
The concept of training models across multiple robot hardware platforms, showing that generalist models perform better than specialists. This enabled scaling laws in robotics.
Used to illustrate the potential economic impact of solving advanced robotics, suggesting a 10% contribution to US GDP would be a massive number, justifying investment in data collection.
A lightweight markup language mentioned as a potential component for building agent-based systems and simplifying task orchestration in research and development.
A large language model that serves as an analogy for the potential disruption in robotics, representing a significant advancement in AI.
A vision-language model adapted for robotics that shows significant transfer of knowledge from language and vision to low-level robotic actions.
Robotic Transformer 2, a model that leverages powerful vision-language models and robotic data to generate low-level actions, enabling tasks like picking specific objects (e.g., a coke can near a picture of Taylor Swift) even without prior exposure to such concepts in robot data.
A significant paper demonstrating scaling laws in robotics by training models across multiple hardware platforms, leading to a generalist model that outperformed specialists.
A benchmark dataset that significantly impacted the vision community, used as a comparison to the Open-X dataset in terms of scale and impact.
A YC company partnered with Physical Intelligence to demonstrate robots folding diverse laundry items in a real laundromat, showcasing domestic applications.
Mentioned alongside Markdown files as tools that could be used in conjunction with agents for orchestrating tasks and research, particularly in an open-source context.
Mentioned as an example of a concept that a robot trained with RT-2 could understand and interact with, demonstrating advanced recognition and manipulation capabilities.
Mentioned in the context of identifying neuron-based inputs and outputs that led to scaling laws, akin to the development of large language models.
Author of the essay 'All Watched Over by Machines of Loving Grace', which is referenced in the context of the ideal manifestation of technology automating abundance.
A YC company focused on logistics, partnered with Physical Intelligence to develop a robot capable of picking items and placing them into pouches for shipping, demonstrating industrial applications.
Mentioned as the origin of some of the 'cracked' people on the Weave team, and also in the context of personal computing evolution.
Mentioned as the source of soft pouches used in the logistics task demonstrated by Ultra, highlighting the need for robots in e-commerce fulfillment.
Mentioned as an example of robots that used to require a server on board, highlighting the shift towards cloud-based compute for modern robotics.
A concept or system related to agents and orchestrating multiple instances, discussed in the context of building research tools and automating processes.
The company where several co-founders of Physical Intelligence previously worked in the robotics team, which provided a foundational environment for their collaboration and advancements.
Mentioned as the previous employer of Adnan, the hardware lead at Physical Intelligence.
More from Y Combinator
View all 580 summaries
21 minThis Startup Wants To Catch Cancer Before It Spreads
32 minThis AI Company Catches Fraud Across the Internet
58 minFrançois Chollet: ARC-AGI-3, Beyond Deep Learning & A New Approach To ML
14 minInside The Startup Reinventing The $6 Trillion Chemical Manufacturing Industry
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free