Why are traditional data-fitting approaches in robotics limited?

Traditional methods require a large amount of data (hundreds of hours) for even simple tasks, suffer from low data efficiency, and have limited multitask performance, lacking the generality seen in human learning.

How do neuro-symbolic concepts help in robotics?

Neuro-symbolic concepts combine neural representations with symbolic reasoning, allowing systems to learn from few demonstrations, generalize to new situations, and plan using compositional abstractions of states and actions.

What is constraint optimization in robotics action generation?

Constraint optimization formulates action generation as finding trajectories that satisfy various constraints, such as joint limits, collision avoidance, and subgoals, enabling the composition of multiple actions.

How does the proposed approach handle generalization in manipulation tasks like hanging objects?

The approach combines visual correspondence guidance with constrained optimization and physical models. It uses neural representations (like Dino V2) for guidance and model-based planning for stability, achieving over 90% success rate in one-shot generalization.

How are diffusion models used for spatial reasoning in table setting?

Compositional diffusion models, trained on specific spatial relationships (e.g., 'left of', 'right of'), generate object poses by composing individual diffusion models. Large language models provide common sense knowledge for constraint generation.

What is the benefit of using a neuro-symbolic approach for long horizon planning?

It enables learning action models from few demonstrations and using planning to combine them. This involves trajectory samplers and transition models to predict future states before execution, leading to generalization in complex scenarios.

Why is the neuro-symbolic framework still relevant in the age of large foundational models?

It provides scientific insights into task understanding and learning, offers a principal way to engineer complex robotic systems by integrating various components, and facilitates the composition of diverse models for enhanced capabilities.

Key Moments

Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning

Stanford Online

Education5 min read58 min video

May 20, 2026|1,104 views|33|3

Stanford Stanford Online Robotics

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Robots can now learn complex tasks from just one demonstration by combining neural networks with symbolic planning, enabling generalization to new objects and goals, but explaining AI reasoning remains a challenge.

Key Insights

Current data-driven AI approaches for robotics require hundreds of hours of training data for simple tasks, starkly contrasting with human learning capabilities.

Neuro-symbolic approaches can achieve over 90% success rate in one-shot learning for object manipulation tasks, significantly outperforming other methods that require more data.

A compositional diffusion model framework allows for the generation of object poses for tasks like table setting by combining specialized diffusion models for each spatial relationship (e.g., 'left of', 'right of').

Long-horizon manipulation tasks can be addressed by composing learned neural trajectory generation models with world models for planning, enabling robots to learn from as few as 10 demonstrations.

Integrating large language models (LLMs) can provide common sense knowledge for spatial reasoning, significantly reducing the need for extensive robot-specific data.

The 'retriever' framework enables closed-loop, asynchronous robot execution with explicit time-typing for modules, improving efficiency and smooth operation for complex tasks.

Bridging the data efficiency gap in robotics

Traditional machine learning approaches for robotics, focused on fitting functions to data, suffer from severe data inefficiency. For example, training a robot to fold boxes requires hundreds of hours of data, and even simple joystick command generation for videogames demands around 200 demonstrations. This contrasts sharply with human capabilities, where learning from a single example and generalizing to new situations is common. The speaker, Jayen, argues that current methods lack the necessary generality and data efficiency for building truly adaptable physical intelligence. The goal is to develop systems that can learn from one to ten demonstrations and generalize reliably to novel states, objects, and goals, moving beyond brittle, task-specific policies.

Neuro-symbolic concepts as a compositional abstraction

The proposed solution involves a paradigm shift towards combining machine learning with planning, framed as a combination of world modeling and planning. Instead of just processing pixels, intelligence can be seen as understanding the world through abstract representations of states and actions. This involves a world model that encompasses object properties, possible actions, and their predicted outcomes. Planning then utilizes these models to find a sequence of actions to achieve a goal. The core idea is to plan using 'neuro-symbolic concepts,' which represent compositional abstractions over states and actions. This approach aims to achieve the generality and data efficiency observed in human intelligence.

Learning actionable constraints for generalized manipulation

Action modeling can be reframed not as direct policy learning, but as learning to satisfy a set of constraints. For instance, picking up an object is not just a sequence of movements, but must satisfy dynamics, collision avoidance, and task-specific goals (e.g., grasping the object appropriately). This constraint-based formulation allows for compositionality, as multiple tasks can be chained by adding temporal and spatial constraints. Learning occurs at two levels: first, identifying the necessary constraints for a task, and second, generating trajectories that satisfy these constraints. This framework allows leveraging existing human-engineered models (like physics simulators or motion planners) for rigid body dynamics and geometric constraints, while learning task-specific constraints from data. An example is one-shot learning of a hanging task: by identifying contact points through visual correspondence (guided by models like DINO V2) and verifying them with model-based planning, the system can generalize to novel objects like mugs or even custom-printed shapes with over 90% success, avoiding the need for extensive object-specific training.

Spatial reasoning powers table setting

Extending this to more complex, long-horizon tasks like setting a table, the framework integrates diffusion models for spatial reasoning. The problem is framed as finding object poses that satisfy symbolic relational constraints ('apple left of plate,' 'spoon right of plate'). These relationships can be learned or inferred using vision-language models and common sense knowledge. A compositional diffusion model architecture is employed, where specialized diffusion models are trained for each type of spatial relationship. These models learn to predict gradients of an energy function, quantifying how well an object's pose satisfies a given constraint. By composing these gradients at inference time, the system can generate plausible object placements for various table settings, even integrating robot motion constraints for real-world execution.

Long-horizon planning with learned action models

For long-horizon planning, the system learns action models that combine trajectory generation and state transition prediction. Using as few as 10 demonstrations for tasks like washing dishes, the system segments trajectories into actions, learning both how to generate plausible trajectories (using diffusion models) and predict future states (using transition models). This allows for internal simulation: the system can predict the outcome of executing a potential trajectory, check for collisions, and replan if necessary. This approach demonstrates generalization to new scenarios, such as sorting multiple plates onto a rack with varying heights, orientations, and novel obstacles, outperforming end-to-end trained policies. The core principle is composing neural trajectory generation with world models for planning.

The enduring relevance of neuro-symbolism in the era of foundation models

Despite advances in large foundation models, neuro-symbolic approaches remain crucial. They offer scientific insights into task and model complexity, helping analyze the expressiveness of neural networks and understand data requirements for learnability (identifiability). From a systems engineering perspective, robotics necessitates compositional systems, and neuro-symbolism provides a principled way to integrate perception, memory, planning, and control. The open-source 'retriever' framework exemplifies this, enabling asynchronous, time-explicit robot execution for complex tasks with memory tracking. Moreover, neuro-symbolism facilitates the principled composition of diverse foundation models (vision, language, action) using algorithms for probabilistic reasoning and planning, moving beyond simple 'chain of thought' reasoning to create robust 'model orchestrators' capable of complex task execution.

Continual learning and self-improvement for robotics

The ultimate vision is for robots to possess continual learning capabilities and become self-improving systems. Rather than relying solely on human-provided data, the goal is to start with basic compositional foundation models and use reasoning, planning, and exploration algorithms to acquire new capabilities and generate new experiences. This new data can then train the next generation of foundation models, creating a virtuous cycle of improvement. This approach mirrors recent successes in distilling knowledge between different foundation models (e.g., combining segmentation and vision-language models for promptable segmentation). Neuro-symbolic concepts provide a framework for bridging modularity, complexity, and intelligence, leading to more data-efficient learning and better generalization for building truly intelligent systems.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

Common Questions

The main goal is to create systems that can perceive, understand language, and take actions in the physical world, mimicking human-level performance in learning and generalization from few demonstrations.

Topics

AI & Machine Learning Technology & Innovation Science & Mathematics Diffusion Models Foundational Models Spatial Reasoning Robotic Manipulation Learning And Planning Neuro-symbolic AI Constraint Optimization Long Horizon Planning

Mentioned in this video

Organizations

UPN

The speaker will soon join UPN as a systems professor.

Companies

Amazon

The speaker is currently a member of technical staff at Amazon.

Software & Apps

Dino V2

Mentioned as a visual feature model used for computing visual feature correspondence to guide contact point detection.

Gemini

Previously used video language models like Gemini to segment trajectories and assign action names.

Retriever

A framework for closed-loop robot agents that supports asynchronous robot action and time-explicit typing.

Segment Anything Model

A model that demonstrates distilling knowledge between vision-language models and segmentation models to enable promptable segmentation based on text instructions.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free