Key Moments
Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Researchers are developing robot learning models that incorporate geometric priors to train 10x more data-efficiently and generalize better than current large-scale models, potentially shifting the scaling curve left.
Key Insights
Traditional model-based robotics, relying on hand-coded geometric models, often failed due to assumptions not reflecting reality, leading to tasks like object pickup failures.
Generalist models like VLA's learn from vast amounts of data but require huge datasets and lose explicit geometric understanding during self-attention, necessitating relearning geometry.
Equivariant diffusion policy, encoding point clouds and reasoning with finite groups of SE(3), demonstrated a 10x improvement in data efficiency over baseline diffusion policy on the MimicGen benchmark with 100 demonstrations versus 1000.
The image-to-sphere method, which embeds images on a sphere and uses Fourier transforms with spherical harmonics, achieved a twofold improvement in data efficiency on MimicGen benchmarks compared to baselines, especially benefiting tasks requiring precision.
Raven, a method using geometric transformer attention by embedding images as 3D rays, provides an intellectually consistent mechanism for combining multiple views and modalities, though questions remain about explicit equivariance and camera calibration.
The Pix-to-Act approach, using multi-view transformers and domain randomization via camera rotation, infers pixel-plane trajectories that are triangulated into 3D, outperforming pre-trained LDM models without pre-training on multitask scenarios.
The limitations of past and present approaches in robotics
For the past three decades, robotics predominantly relied on hand-coded geometric models for model-based planning. While powerful, these methods made strong assumptions about the environment that often proved unrealistic, leading to critical failures, such as robots misestimating object locations. In contrast, current generalist models, like large vision-language models (VLMs), learn directly from data, overcoming some of these limitations. However, they require massive datasets for training and tend to lose explicit geometric understanding during their reasoning processes, necessitating extensive data to relearn this spatial information. This creates a dichotomy: on one extreme are rigid, assumption-laden models, and on the other are data-hungry, geometrically disembodied generalist models.
Introducing geometric structure into machine learning models
The central question explored is whether a middle ground exists—machine learning models that effectively incorporate geometric or physical priors. The research presented proposes methods to encode observations in ways that retain geometric structure, aiming to improve policy learning. This involves developing novel representations of the world, such as point clouds, embeddings on a sphere, 3D rays, and stereo images. These geometric representations are then processed using frameworks designed to integrate geometric reasoning, including finite subgroups of SE(3), Fourier coefficients, and geometric transformer attention. The goal is to create models that are smarter by design, leveraging inherent symmetries and structures of the physical world rather than learning them from scratch through sheer data volume.
Equivariant diffusion policy: 10x data efficiency through symmetry
The first presented method, 'equivariant diffusion policy,' focuses on encoding the world as a point cloud and reasoning using finite groups. Inspired by Emmy Noether's theorems, the approach embeds symmetries (like translation and rotation) into the model, which correspond to physical conservation laws. For example, translation and rotation invariance in transition dynamics are expected to lead to equivariant optimal policies. The model utilizes equivariant neural network layers, specifically designed to ensure that transformations in the input (like rotations) automatically lead to corresponding transformations in the output flow field or action. Benchmarked on the MimicGen tasks, this method demonstrated a 10x improvement in data efficiency compared to a standard diffusion policy, achieving better performance with 100 demonstrations than the baseline with 1000, particularly excelling in tasks requiring high equivariance and generalizing better over pose.
Image-to-sphere: Leveraging RGB input with geometric priors
The second approach, 'image-to-sphere,' addresses the challenge of using direct RGB input by embedding images onto a 2D sphere. An RGB image is encoded, its features projected onto the sphere, and then manipulated using SO(3) rotations and spherical harmonics for Fourier transforms. This allows for convolutions in both spherical and SO(3) spaces. The processed features are then brought back to a discrete subgroup of SO(3) and used within an equivariant diffusion framework. While not achieving the same performance as the point cloud method, it significantly outperforms baselines by a factor of two in data efficiency on MimicGen benchmarks, proving crucial for tasks requiring high precision, such as manipulating small objects, where point clouds might lack resolution. It also showed improved performance with pre-trained image encoders, reaching 72% success.
Raven: Geometric transformer attention for multi-view reasoning
The 'Raven' method embeds images as 3D rays, where each ray from the camera to an image patch is oriented with respect to the patch's texture. This allows for a more intuitive embedding of images into the 3D world. To reason about these rays, Raven employs 'geometric transformer attention.' This technique transforms queries, keys, and values into a common reference frame before the attention operation, and then back, essentially incorporating reference frames into the attention mechanism without fundamentally altering the transformer's learning process. While this method shows slightly lower performance gains than previous approaches, its strength lies in its conceptually consistent mechanism for combining multiple views and modalities. It provides a unified way to handle different data types by attaching them to a common reference frame, though it requires camera calibration and careful design of coordinate frames.
Pix-to-Act: Multi-view transformers and data augmentation for precise control
The 'Pix-to-Act' approach utilizes two cameras mounted on a robot's end-effector, inferring trajectories for key points within each camera's image plane. These planar trajectories are then triangulated into 3D space. The model employs a multi-view transformer that performs self-attention within each image and cross-attention between images, allowing it to integrate context from different views. A key innovation here is a novel data augmentation technique—randomly rotating the cameras around their visual axes—which forces the model to focus on local image structure for inferring trajectories. This significantly improves generalization over viewpoint, enabling the model to outperform pre-trained large baseline models (like LDM) on multitask scenarios without any pre-training, and achieve high success rates in complex tasks like coffee making.
The impact of geometry on scaling laws and future directions
The overarching hope is that by incorporating geometric priors and symmetries, these models can shift the scaling curves to the left, meaning they become more data-efficient. Instead of solely relying on massive datasets, smarter models can leverage structural knowledge. Early evidence from workshop papers and experiments suggests that equivariant models can indeed shift these power-law scaling relationships, making data count for more. The research acknowledges limitations, such as focusing primarily on translation and rotation as symmetries, and the need for careful integration of coordinate frames. Future work may involve exploring more complex symmetries, applying these techniques across diverse data modalities like tactile sensing, and further validating the shift in scaling laws. The ultimate goal is to move towards robots that learn more efficiently by incorporating fundamental knowledge of the physical world.
Mentioned in This Episode
●Products
●Software & Apps
●Organizations
●Studies Cited
●Concepts
●People Referenced
Equivariant Diffusion Policy: Data Efficiency Comparison
Data extracted from this episode
| Model | Demonstrations | Performance |
|---|---|---|
| Equivariant Diffusion Policy (Point Cloud) | 100 | Better than baseline Diff Policy with 1000 demos |
| Baseline Diffusion Policy | 1000 | Inferior to Equivariant Diffusion Policy with 100 demos |
Image-to-Sphere: Data Efficiency Comparison
Data extracted from this episode
| Model | Demonstrations | Performance |
|---|---|---|
| Image-to-Sphere Method | 100 | Outperforms baseline by 2x in data efficiency |
| Baseline Models | 200 | Inferior to Image-to-Sphere with 100 demos |
Tasks Solved by Equivariant Models within 100 Demonstrations
Data extracted from this episode
| Model | Task | Demonstrations Required |
|---|---|---|
| Equivariant Diffusion Policy | Physical Tasks (various) | <100 (except coffee making at 160) |
| Image-to-Sphere | Physical Tasks (various) | ~60-70 |
Common Questions
Traditional robotics models rely on hand-coded geometric models that make strong assumptions about the environment, leading to failures when these assumptions don't match reality. Generalist models, like VLA, learn directly from data, overcoming some limitations but requiring vast amounts of training data.
Topics
Mentioned in this video
A classic example of a model-based robotics paper from 2022 that won an RSS best paper award, demonstrating 'You Only Demonstrate Once' learning by estimating object locations from CAD models.
A benchmark dataset for robotic manipulation tasks, comprising 12 different tasks used to evaluate various policy learning methods.
A large vision-language-action model mentioned as an example of current generalist models in robotics.
A visual encoder used in generalist VALA models, which converts visual input into embeddings.
A type of neural network architecture, often used as a pre-trained encoder for image processing in robotics models.
A baseline model for learning robot control policies, serving as a comparison point for the proposed equivariant methods.
A robotics policy learning model mentioned as a baseline in performance comparisons.
A recent method developed by the speaker's lab that uses multi-view transformers and a novel data augmentation technique for inferring 3D trajectories from 2D image planes.
A language-model-based reasoning model, likely referring to a large diffusion model, that is outperformed by the Pix2Act method even with less pre-training.
Inertial Measurement Unit, a device used to measure the orientation and rotation of a vehicle or manipulator, which can be incorporated into equivariant models.
A cyclic group with four elements, representing 90-degree rotations, used as an example of a finite subgroup for enforcing equivariance.
Mathematical objects used as basis functions for representing group representations, specifically in the context of SO3 convolutions in the Fourier space for the image-to-sphere method.
A victory that comes at such a great cost that it is tantamount to defeat, used as an analogy for relying solely on large amounts of data without model intelligence.
More from Stanford Online
View all 72 summaries
78 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality
65 minStanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence
83 minStanford CS25: Transformers United V6 I Serving Transformers: Lessons from the Trenches
110 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free