How can geometry be incorporated into machine learning models for robotics?

The speaker proposes methods that encode observations to retain geometric information, use representations like point clouds or embeddings on a sphere, and employ techniques like finite subgroups of SE3 or geometric transformer attention to reason about these representations.

What is the key advantage of equivariant neural networks in robotics?

Equivariant networks encode physical symmetries, such as translation and rotation invariance, directly into the model. This leads to better data efficiency and generalization over pose, as the model doesn't need to learn these fundamental properties from scratch.

How does the Equivariant Diffusion Policy improve learning efficiency?

By using equivariant layers, this policy leverages geometric structure and symmetries. It achieved a 10x improvement in data efficiency with point clouds, performing better with 100 demonstrations than a standard diffusion policy with 1000.

What are the limitations of the Equivariant Diffusion Policy?

The model is programmed to generalize over finite groups, not continuous ones, requiring data augmentation for finer rotations. Larger discrete groups can also be computationally expensive, and point clouds are generally sparser than images.

How does the Image-to-Sphere method handle RGB input for robotics?

This method embeds RGB images onto a sphere, allowing for SO3 rotations in Fourier space using spherical harmonics. It's particularly beneficial for tasks requiring high precision, like manipulating small objects.

What is the 'Raven' method and its advantage?

Raven embeds images as 3D rays and uses geometric transformer attention. Its main advantage is providing an intellectually consistent mechanism for accommodating multiple views or modalities (like pixels and force data) by attaching them to a common reference frame.

How does the Pix2Act method improve generalization from images?

Pix2Act uses a novel data augmentation technique involving rotating cameras on their visual axes. This forces the model to focus on local image structure, leading to strong generalization over viewpoints without explicit equivariance layers.

What is the 'scaling law' in machine learning and how can it be improved?

Scaling laws describe the power-law relationship between data size, model size, and performance. Using 'smarter' models with built-in structure, like equivariance, can shift this scaling curve to the left, meaning data counts for more and better performance is achieved with less data.

Can these geometric approaches be applied to tactile data?

Yes, the speaker states these ideas apply without modification to force data. For tactile data, it's similar to image data if captured with sensors like GelSight, though managing multiple coordinate systems for multi-fingered hands would be necessary.

How does incorporating structure into robot learning models affect generality?

While structure can be seen as a limitation, the speaker argues that incorporating biases through concepts like equivariance doesn't preclude solutions found in the data. It acts as a helpful prior, improving data efficiency without sacrificing the ability to adapt.

Key Moments

Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning

Q: What is the main difference between traditional robotics models and modern generalist models?

Traditional robotics models rely on hand-coded geometric models that make strong assumptions about the environment, leading to failures when these assumptions don't match reality. Generalist models, like VLA, learn directly from data, overcoming some limitations but requiring vast amounts of training data.

Stanford Online

Education6 min read64 min video

Jun 4, 2026|723 views|29|1

Stanford Stanford Online Robotics

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Researchers are developing robot learning models that incorporate geometric priors to train 10x more data-efficiently and generalize better than current large-scale models, potentially shifting the scaling curve left.

Key Insights

Traditional model-based robotics, relying on hand-coded geometric models, often failed due to assumptions not reflecting reality, leading to tasks like object pickup failures.

Generalist models like VLA's learn from vast amounts of data but require huge datasets and lose explicit geometric understanding during self-attention, necessitating relearning geometry.

Equivariant diffusion policy, encoding point clouds and reasoning with finite groups of SE(3), demonstrated a 10x improvement in data efficiency over baseline diffusion policy on the MimicGen benchmark with 100 demonstrations versus 1000.

The image-to-sphere method, which embeds images on a sphere and uses Fourier transforms with spherical harmonics, achieved a twofold improvement in data efficiency on MimicGen benchmarks compared to baselines, especially benefiting tasks requiring precision.

Raven, a method using geometric transformer attention by embedding images as 3D rays, provides an intellectually consistent mechanism for combining multiple views and modalities, though questions remain about explicit equivariance and camera calibration.

The Pix-to-Act approach, using multi-view transformers and domain randomization via camera rotation, infers pixel-plane trajectories that are triangulated into 3D, outperforming pre-trained LDM models without pre-training on multitask scenarios.

The limitations of past and present approaches in robotics

For the past three decades, robotics predominantly relied on hand-coded geometric models for model-based planning. While powerful, these methods made strong assumptions about the environment that often proved unrealistic, leading to critical failures, such as robots misestimating object locations. In contrast, current generalist models, like large vision-language models (VLMs), learn directly from data, overcoming some of these limitations. However, they require massive datasets for training and tend to lose explicit geometric understanding during their reasoning processes, necessitating extensive data to relearn this spatial information. This creates a dichotomy: on one extreme are rigid, assumption-laden models, and on the other are data-hungry, geometrically disembodied generalist models.

Introducing geometric structure into machine learning models

The central question explored is whether a middle ground exists—machine learning models that effectively incorporate geometric or physical priors. The research presented proposes methods to encode observations in ways that retain geometric structure, aiming to improve policy learning. This involves developing novel representations of the world, such as point clouds, embeddings on a sphere, 3D rays, and stereo images. These geometric representations are then processed using frameworks designed to integrate geometric reasoning, including finite subgroups of SE(3), Fourier coefficients, and geometric transformer attention. The goal is to create models that are smarter by design, leveraging inherent symmetries and structures of the physical world rather than learning them from scratch through sheer data volume.

Equivariant diffusion policy: 10x data efficiency through symmetry

The first presented method, 'equivariant diffusion policy,' focuses on encoding the world as a point cloud and reasoning using finite groups. Inspired by Emmy Noether's theorems, the approach embeds symmetries (like translation and rotation) into the model, which correspond to physical conservation laws. For example, translation and rotation invariance in transition dynamics are expected to lead to equivariant optimal policies. The model utilizes equivariant neural network layers, specifically designed to ensure that transformations in the input (like rotations) automatically lead to corresponding transformations in the output flow field or action. Benchmarked on the MimicGen tasks, this method demonstrated a 10x improvement in data efficiency compared to a standard diffusion policy, achieving better performance with 100 demonstrations than the baseline with 1000, particularly excelling in tasks requiring high equivariance and generalizing better over pose.

Image-to-sphere: Leveraging RGB input with geometric priors

The second approach, 'image-to-sphere,' addresses the challenge of using direct RGB input by embedding images onto a 2D sphere. An RGB image is encoded, its features projected onto the sphere, and then manipulated using SO(3) rotations and spherical harmonics for Fourier transforms. This allows for convolutions in both spherical and SO(3) spaces. The processed features are then brought back to a discrete subgroup of SO(3) and used within an equivariant diffusion framework. While not achieving the same performance as the point cloud method, it significantly outperforms baselines by a factor of two in data efficiency on MimicGen benchmarks, proving crucial for tasks requiring high precision, such as manipulating small objects, where point clouds might lack resolution. It also showed improved performance with pre-trained image encoders, reaching 72% success.

Raven: Geometric transformer attention for multi-view reasoning

The 'Raven' method embeds images as 3D rays, where each ray from the camera to an image patch is oriented with respect to the patch's texture. This allows for a more intuitive embedding of images into the 3D world. To reason about these rays, Raven employs 'geometric transformer attention.' This technique transforms queries, keys, and values into a common reference frame before the attention operation, and then back, essentially incorporating reference frames into the attention mechanism without fundamentally altering the transformer's learning process. While this method shows slightly lower performance gains than previous approaches, its strength lies in its conceptually consistent mechanism for combining multiple views and modalities. It provides a unified way to handle different data types by attaching them to a common reference frame, though it requires camera calibration and careful design of coordinate frames.

Pix-to-Act: Multi-view transformers and data augmentation for precise control

The 'Pix-to-Act' approach utilizes two cameras mounted on a robot's end-effector, inferring trajectories for key points within each camera's image plane. These planar trajectories are then triangulated into 3D space. The model employs a multi-view transformer that performs self-attention within each image and cross-attention between images, allowing it to integrate context from different views. A key innovation here is a novel data augmentation technique—randomly rotating the cameras around their visual axes—which forces the model to focus on local image structure for inferring trajectories. This significantly improves generalization over viewpoint, enabling the model to outperform pre-trained large baseline models (like LDM) on multitask scenarios without any pre-training, and achieve high success rates in complex tasks like coffee making.

The impact of geometry on scaling laws and future directions

The overarching hope is that by incorporating geometric priors and symmetries, these models can shift the scaling curves to the left, meaning they become more data-efficient. Instead of solely relying on massive datasets, smarter models can leverage structural knowledge. Early evidence from workshop papers and experiments suggests that equivariant models can indeed shift these power-law scaling relationships, making data count for more. The research acknowledges limitations, such as focusing primarily on translation and rotation as symmetries, and the need for careful integration of coordinate frames. Future work may involve exploring more complex symmetries, applying these techniques across diverse data modalities like tactile sensing, and further validating the shift in scaling laws. The ultimate goal is to move towards robots that learn more efficiently by incorporating fundamental knowledge of the physical world.

Mentioned in This Episode

●Products

●Software & Apps

●Organizations

●Studies Cited

●Concepts

●People Referenced

Equivariant Diffusion Policy: Data Efficiency Comparison

Data extracted from this episode

Model	Demonstrations	Performance
Equivariant Diffusion Policy (Point Cloud)	100	Better than baseline Diff Policy with 1000 demos
Baseline Diffusion Policy	1000	Inferior to Equivariant Diffusion Policy with 100 demos

Image-to-Sphere: Data Efficiency Comparison

Data extracted from this episode

Model	Demonstrations	Performance
Image-to-Sphere Method	100	Outperforms baseline by 2x in data efficiency
Baseline Models	200	Inferior to Image-to-Sphere with 100 demos

Tasks Solved by Equivariant Models within 100 Demonstrations

Data extracted from this episode

Model	Task	Demonstrations Required
Equivariant Diffusion Policy	Physical Tasks (various)	<100 (except coffee making at 160)
Image-to-Sphere	Physical Tasks (various)	~60-70

Common Questions

Traditional robotics models rely on hand-coded geometric models that make strong assumptions about the environment, leading to failures when these assumptions don't match reality. Generalist models, like VLA, learn directly from data, overcoming some limitations but requiring vast amounts of training data.

Topics

AI & Machine Learning Technology & Innovation Science & Mathematics Neural Networks Machine Learning Computer Vision Policy Learning Geometric Deep Learning

Mentioned in this video

People

Rob Platt

The speaker and presenter at the Stanford Robotics Seminar.

Emmy Noether

A mathematician known for Noether's theorem, which establishes a correspondence between conservation laws in physics and symmetries in the real world.

Studies & Research

Yodo

A classic example of a model-based robotics paper from 2022 that won an RSS best paper award, demonstrating 'You Only Demonstrate Once' learning by estimating object locations from CAD models.

MimicGen

A benchmark dataset for robotic manipulation tasks, comprising 12 different tasks used to evaluate various policy learning methods.

Software & Apps

XVLA

A large vision-language-action model mentioned as an example of current generalist models in robotics.

CLIP

A visual encoder used in generalist VALA models, which converts visual input into embeddings.

ResNet

A type of neural network architecture, often used as a pre-trained encoder for image processing in robotics models.

Diffusion Policy

A baseline model for learning robot control policies, serving as a comparison point for the proposed equivariant methods.

ACT

A robotics policy learning model mentioned as a baseline in performance comparisons.

Pix2Act

A recent method developed by the speaker's lab that uses multi-view transformers and a novel data augmentation technique for inferring 3D trajectories from 2D image planes.

LDM

A language-model-based reasoning model, likely referring to a large diffusion model, that is outperformed by the Pix2Act method even with less pre-training.

IMU

Inertial Measurement Unit, a device used to measure the orientation and rotation of a vehicle or manipulator, which can be incorporated into equivariant models.

Organizations

TRRI

The organization behind a specific vision-language-action (VALA) model discussed, exemplified by their model from the previous year.

Companies

Manifold

A generalist model discussed, similar in structure to other VALA models, featuring a visual encoder and diffusion transformer.

Concepts

A cyclic group with four elements, representing 90-degree rotations, used as an example of a finite subgroup for enforcing equivariance.

Wigner D-matrices

Mathematical objects used as basis functions for representing group representations, specifically in the context of SO3 convolutions in the Fourier space for the image-to-sphere method.

Pyrrhic victory

A victory that comes at such a great cost that it is tantamount to defeat, used as an analogy for relying solely on large amounts of data without model intelligence.

Products

GelSight

A tactile sensing technology that can capture high-resolution images of surfaces, which can be used with the proposed methods for tactile data.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

Stanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning

Want to know something specific about what's covered?

Key Insights

The limitations of past and present approaches in robotics

Introducing geometric structure into machine learning models

Equivariant diffusion policy: 10x data efficiency through symmetry

Image-to-sphere: Leveraging RGB input with geometric priors

Raven: Geometric transformer attention for multi-view reasoning

Pix-to-Act: Multi-view transformers and data augmentation for precise control

The impact of geometry on scaling laws and future directions

Mentioned in This Episode

Equivariant Diffusion Policy: Data Efficiency Comparison

Image-to-Sphere: Data Efficiency Comparison

Tasks Solved by Equivariant Models within 100 Demonstrations

Common Questions

Topics

Mentioned in this video

More from Stanford Online

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 17: Alignment - Multimodality

Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence

Stanford CS25: Transformers United V6 I Serving Transformers: Lessons from the Trenches

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics

Ask anything from this episode.