Key Moments
Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Robots can now learn complex tasks from just one demonstration by combining neural networks with symbolic planning, enabling generalization to new objects and goals, but explaining AI reasoning remains a challenge.
Key Insights
Current data-driven AI approaches for robotics require hundreds of hours of training data for simple tasks, starkly contrasting with human learning capabilities.
Neuro-symbolic approaches can achieve over 90% success rate in one-shot learning for object manipulation tasks, significantly outperforming other methods that require more data.
A compositional diffusion model framework allows for the generation of object poses for tasks like table setting by combining specialized diffusion models for each spatial relationship (e.g., 'left of', 'right of').
Long-horizon manipulation tasks can be addressed by composing learned neural trajectory generation models with world models for planning, enabling robots to learn from as few as 10 demonstrations.
Integrating large language models (LLMs) can provide common sense knowledge for spatial reasoning, significantly reducing the need for extensive robot-specific data.
The 'retriever' framework enables closed-loop, asynchronous robot execution with explicit time-typing for modules, improving efficiency and smooth operation for complex tasks.
Bridging the data efficiency gap in robotics
Traditional machine learning approaches for robotics, focused on fitting functions to data, suffer from severe data inefficiency. For example, training a robot to fold boxes requires hundreds of hours of data, and even simple joystick command generation for videogames demands around 200 demonstrations. This contrasts sharply with human capabilities, where learning from a single example and generalizing to new situations is common. The speaker, Jayen, argues that current methods lack the necessary generality and data efficiency for building truly adaptable physical intelligence. The goal is to develop systems that can learn from one to ten demonstrations and generalize reliably to novel states, objects, and goals, moving beyond brittle, task-specific policies.
Neuro-symbolic concepts as a compositional abstraction
The proposed solution involves a paradigm shift towards combining machine learning with planning, framed as a combination of world modeling and planning. Instead of just processing pixels, intelligence can be seen as understanding the world through abstract representations of states and actions. This involves a world model that encompasses object properties, possible actions, and their predicted outcomes. Planning then utilizes these models to find a sequence of actions to achieve a goal. The core idea is to plan using 'neuro-symbolic concepts,' which represent compositional abstractions over states and actions. This approach aims to achieve the generality and data efficiency observed in human intelligence.
Learning actionable constraints for generalized manipulation
Action modeling can be reframed not as direct policy learning, but as learning to satisfy a set of constraints. For instance, picking up an object is not just a sequence of movements, but must satisfy dynamics, collision avoidance, and task-specific goals (e.g., grasping the object appropriately). This constraint-based formulation allows for compositionality, as multiple tasks can be chained by adding temporal and spatial constraints. Learning occurs at two levels: first, identifying the necessary constraints for a task, and second, generating trajectories that satisfy these constraints. This framework allows leveraging existing human-engineered models (like physics simulators or motion planners) for rigid body dynamics and geometric constraints, while learning task-specific constraints from data. An example is one-shot learning of a hanging task: by identifying contact points through visual correspondence (guided by models like DINO V2) and verifying them with model-based planning, the system can generalize to novel objects like mugs or even custom-printed shapes with over 90% success, avoiding the need for extensive object-specific training.
Spatial reasoning powers table setting
Extending this to more complex, long-horizon tasks like setting a table, the framework integrates diffusion models for spatial reasoning. The problem is framed as finding object poses that satisfy symbolic relational constraints ('apple left of plate,' 'spoon right of plate'). These relationships can be learned or inferred using vision-language models and common sense knowledge. A compositional diffusion model architecture is employed, where specialized diffusion models are trained for each type of spatial relationship. These models learn to predict gradients of an energy function, quantifying how well an object's pose satisfies a given constraint. By composing these gradients at inference time, the system can generate plausible object placements for various table settings, even integrating robot motion constraints for real-world execution.
Long-horizon planning with learned action models
For long-horizon planning, the system learns action models that combine trajectory generation and state transition prediction. Using as few as 10 demonstrations for tasks like washing dishes, the system segments trajectories into actions, learning both how to generate plausible trajectories (using diffusion models) and predict future states (using transition models). This allows for internal simulation: the system can predict the outcome of executing a potential trajectory, check for collisions, and replan if necessary. This approach demonstrates generalization to new scenarios, such as sorting multiple plates onto a rack with varying heights, orientations, and novel obstacles, outperforming end-to-end trained policies. The core principle is composing neural trajectory generation with world models for planning.
The enduring relevance of neuro-symbolism in the era of foundation models
Despite advances in large foundation models, neuro-symbolic approaches remain crucial. They offer scientific insights into task and model complexity, helping analyze the expressiveness of neural networks and understand data requirements for learnability (identifiability). From a systems engineering perspective, robotics necessitates compositional systems, and neuro-symbolism provides a principled way to integrate perception, memory, planning, and control. The open-source 'retriever' framework exemplifies this, enabling asynchronous, time-explicit robot execution for complex tasks with memory tracking. Moreover, neuro-symbolism facilitates the principled composition of diverse foundation models (vision, language, action) using algorithms for probabilistic reasoning and planning, moving beyond simple 'chain of thought' reasoning to create robust 'model orchestrators' capable of complex task execution.
Continual learning and self-improvement for robotics
The ultimate vision is for robots to possess continual learning capabilities and become self-improving systems. Rather than relying solely on human-provided data, the goal is to start with basic compositional foundation models and use reasoning, planning, and exploration algorithms to acquire new capabilities and generate new experiences. This new data can then train the next generation of foundation models, creating a virtuous cycle of improvement. This approach mirrors recent successes in distilling knowledge between different foundation models (e.g., combining segmentation and vision-language models for promptable segmentation). Neuro-symbolic concepts provide a framework for bridging modularity, complexity, and intelligence, leading to more data-efficient learning and better generalization for building truly intelligent systems.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
Common Questions
The main goal is to create systems that can perceive, understand language, and take actions in the physical world, mimicking human-level performance in learning and generalization from few demonstrations.
Topics
Mentioned in this video
Mentioned as a visual feature model used for computing visual feature correspondence to guide contact point detection.
Previously used video language models like Gemini to segment trajectories and assign action names.
A framework for closed-loop robot agents that supports asynchronous robot action and time-explicit typing.
A model that demonstrates distilling knowledge between vision-language models and segmentation models to enable promptable segmentation based on text instructions.
More from Stanford Online
View all 58 summaries
51 minStanford CS547 HCI Seminar | Spring 2026 | HCI and Human-Centered AI for Digital Health
72 minStanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy
48 minStanford CS153 Frontier Systems | The AI Native Company: How One Founder Becomes a 1000x Engineer
73 minStanford CS25: Transformers United V6 I Distinct Modes of Generalization from Parameters and Context
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free