Key Moments

Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning

Stanford OnlineStanford Online
Education6 min read64 min video
Jun 4, 2026|723 views|29|1
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

Researchers are developing robot learning models that incorporate geometric priors to train 10x more data-efficiently and generalize better than current large-scale models, potentially shifting the scaling curve left.

Key Insights

1

Traditional model-based robotics, relying on hand-coded geometric models, often failed due to assumptions not reflecting reality, leading to tasks like object pickup failures.

2

Generalist models like VLA's learn from vast amounts of data but require huge datasets and lose explicit geometric understanding during self-attention, necessitating relearning geometry.

3

Equivariant diffusion policy, encoding point clouds and reasoning with finite groups of SE(3), demonstrated a 10x improvement in data efficiency over baseline diffusion policy on the MimicGen benchmark with 100 demonstrations versus 1000.

4

The image-to-sphere method, which embeds images on a sphere and uses Fourier transforms with spherical harmonics, achieved a twofold improvement in data efficiency on MimicGen benchmarks compared to baselines, especially benefiting tasks requiring precision.

5

Raven, a method using geometric transformer attention by embedding images as 3D rays, provides an intellectually consistent mechanism for combining multiple views and modalities, though questions remain about explicit equivariance and camera calibration.

6

The Pix-to-Act approach, using multi-view transformers and domain randomization via camera rotation, infers pixel-plane trajectories that are triangulated into 3D, outperforming pre-trained LDM models without pre-training on multitask scenarios.

The limitations of past and present approaches in robotics

For the past three decades, robotics predominantly relied on hand-coded geometric models for model-based planning. While powerful, these methods made strong assumptions about the environment that often proved unrealistic, leading to critical failures, such as robots misestimating object locations. In contrast, current generalist models, like large vision-language models (VLMs), learn directly from data, overcoming some of these limitations. However, they require massive datasets for training and tend to lose explicit geometric understanding during their reasoning processes, necessitating extensive data to relearn this spatial information. This creates a dichotomy: on one extreme are rigid, assumption-laden models, and on the other are data-hungry, geometrically disembodied generalist models.

Introducing geometric structure into machine learning models

The central question explored is whether a middle ground exists—machine learning models that effectively incorporate geometric or physical priors. The research presented proposes methods to encode observations in ways that retain geometric structure, aiming to improve policy learning. This involves developing novel representations of the world, such as point clouds, embeddings on a sphere, 3D rays, and stereo images. These geometric representations are then processed using frameworks designed to integrate geometric reasoning, including finite subgroups of SE(3), Fourier coefficients, and geometric transformer attention. The goal is to create models that are smarter by design, leveraging inherent symmetries and structures of the physical world rather than learning them from scratch through sheer data volume.

Equivariant diffusion policy: 10x data efficiency through symmetry

The first presented method, 'equivariant diffusion policy,' focuses on encoding the world as a point cloud and reasoning using finite groups. Inspired by Emmy Noether's theorems, the approach embeds symmetries (like translation and rotation) into the model, which correspond to physical conservation laws. For example, translation and rotation invariance in transition dynamics are expected to lead to equivariant optimal policies. The model utilizes equivariant neural network layers, specifically designed to ensure that transformations in the input (like rotations) automatically lead to corresponding transformations in the output flow field or action. Benchmarked on the MimicGen tasks, this method demonstrated a 10x improvement in data efficiency compared to a standard diffusion policy, achieving better performance with 100 demonstrations than the baseline with 1000, particularly excelling in tasks requiring high equivariance and generalizing better over pose.

Image-to-sphere: Leveraging RGB input with geometric priors

The second approach, 'image-to-sphere,' addresses the challenge of using direct RGB input by embedding images onto a 2D sphere. An RGB image is encoded, its features projected onto the sphere, and then manipulated using SO(3) rotations and spherical harmonics for Fourier transforms. This allows for convolutions in both spherical and SO(3) spaces. The processed features are then brought back to a discrete subgroup of SO(3) and used within an equivariant diffusion framework. While not achieving the same performance as the point cloud method, it significantly outperforms baselines by a factor of two in data efficiency on MimicGen benchmarks, proving crucial for tasks requiring high precision, such as manipulating small objects, where point clouds might lack resolution. It also showed improved performance with pre-trained image encoders, reaching 72% success.

Raven: Geometric transformer attention for multi-view reasoning

The 'Raven' method embeds images as 3D rays, where each ray from the camera to an image patch is oriented with respect to the patch's texture. This allows for a more intuitive embedding of images into the 3D world. To reason about these rays, Raven employs 'geometric transformer attention.' This technique transforms queries, keys, and values into a common reference frame before the attention operation, and then back, essentially incorporating reference frames into the attention mechanism without fundamentally altering the transformer's learning process. While this method shows slightly lower performance gains than previous approaches, its strength lies in its conceptually consistent mechanism for combining multiple views and modalities. It provides a unified way to handle different data types by attaching them to a common reference frame, though it requires camera calibration and careful design of coordinate frames.

Pix-to-Act: Multi-view transformers and data augmentation for precise control

The 'Pix-to-Act' approach utilizes two cameras mounted on a robot's end-effector, inferring trajectories for key points within each camera's image plane. These planar trajectories are then triangulated into 3D space. The model employs a multi-view transformer that performs self-attention within each image and cross-attention between images, allowing it to integrate context from different views. A key innovation here is a novel data augmentation technique—randomly rotating the cameras around their visual axes—which forces the model to focus on local image structure for inferring trajectories. This significantly improves generalization over viewpoint, enabling the model to outperform pre-trained large baseline models (like LDM) on multitask scenarios without any pre-training, and achieve high success rates in complex tasks like coffee making.

The impact of geometry on scaling laws and future directions

The overarching hope is that by incorporating geometric priors and symmetries, these models can shift the scaling curves to the left, meaning they become more data-efficient. Instead of solely relying on massive datasets, smarter models can leverage structural knowledge. Early evidence from workshop papers and experiments suggests that equivariant models can indeed shift these power-law scaling relationships, making data count for more. The research acknowledges limitations, such as focusing primarily on translation and rotation as symmetries, and the need for careful integration of coordinate frames. Future work may involve exploring more complex symmetries, applying these techniques across diverse data modalities like tactile sensing, and further validating the shift in scaling laws. The ultimate goal is to move towards robots that learn more efficiently by incorporating fundamental knowledge of the physical world.

Equivariant Diffusion Policy: Data Efficiency Comparison

Data extracted from this episode

ModelDemonstrationsPerformance
Equivariant Diffusion Policy (Point Cloud)100Better than baseline Diff Policy with 1000 demos
Baseline Diffusion Policy1000Inferior to Equivariant Diffusion Policy with 100 demos

Image-to-Sphere: Data Efficiency Comparison

Data extracted from this episode

ModelDemonstrationsPerformance
Image-to-Sphere Method100Outperforms baseline by 2x in data efficiency
Baseline Models200Inferior to Image-to-Sphere with 100 demos

Tasks Solved by Equivariant Models within 100 Demonstrations

Data extracted from this episode

ModelTaskDemonstrations Required
Equivariant Diffusion PolicyPhysical Tasks (various)<100 (except coffee making at 160)
Image-to-SpherePhysical Tasks (various)~60-70

Common Questions

Traditional robotics models rely on hand-coded geometric models that make strong assumptions about the environment, leading to failures when these assumptions don't match reality. Generalist models, like VLA, learn directly from data, overcoming some limitations but requiring vast amounts of training data.

Topics

Mentioned in this video

More from Stanford Online

View all 72 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free