Key Moments
Stanford Robotics Seminar ENGR319 | Winter 2026 | Gen Control, Action Chunking, Moravec’s Paradox
Key Moments
Robots struggle with physical tasks due to compounding errors, but action chunking and generative policies offer solutions by improving stability and learning inductive biases, making robotics more scalable.
Key Insights
Historically, robots have found physical tasks like motion and manipulation harder than symbolic reasoning tasks, a phenomenon known as Moravec's Paradox.
Behavior cloning, while effective for supervised learning, suffers from compounding error in continuous control settings, leading to instability even if expert and dynamics are stable.
Action chunking, the practice of predicting multiple actions in sequence and executing them open-loop, significantly mitigates compounding error by ensuring closed-loop stability for sufficiently large chunks.
Generative control policies, particularly flow-based models, show benefits not primarily from distributional learning but from stochasticity injection and iterative computation, leading to lower off-manifold error.
The minimal iterative policy (MIP) demonstrates that the combination of stochasticity and iterative computation, without explicit distributional fitting, can match or exceed flow policy performance on many tasks.
Injecting noise into actions during data collection can theoretically guarantee imitation without compounding error by exciting controllable modes in the system, a principle that flow policies seem to emulate.
The persistent difficulty of physical robot tasks
The seminar opens by revisiting Moravec's Paradox, an observation that performing tasks requiring high-level reasoning, like solving math problems or playing chess, has historically been easier for robots than tasks involving basic human motor skills and perception, such as walking, grasping, or manipulating objects in the physical world. Despite significant progress, especially in the last few years, in enabling robots to perform narrowly scoped manipulation tasks through techniques like behavior cloning, a core challenge remains: the fundamental difficulty in learning from demonstration in continuous control settings. The speaker posits that this isn't just a data problem but an 'algorithmic Moravec's Paradox,' where specific algorithmic interventions are crucial for learning effectively from collected data, especially in the physical domain.
Compounding error: The Achilles' heel of behavior cloning
In traditional behavior cloning (BC), algorithms aim to minimize supervised learning errors between expert demonstrations and the learned policy. However, the critical issue isn't generalization on training data but the mismatch between minimizing fitting error and maximizing cumulative reward. This arises because BC minimizes error at each state observed in the training data, yet during rollout, the policy generates its own states, potentially leading to small errors that accumulate exponentially over the 'cursive horizon.' Using the MDP formalism, the talk frames robot learning as a control problem where stability is key. Even when expert policies and system dynamics are stable in both open-loop and closed-loop senses, a learned policy can itself induce instability. A negative result is presented: even in the 'nicest' possible settings (smooth, deterministic, Markovian, stable expert and dynamics, and restricted learner policies), any algorithm only using demonstration data faces an exponential compounding error, and this limitation applies even to offline reinforcement learning algorithms and persists even if the reward function is known.
Action chunking: Stabilizing robot execution
Action chunking emerges as a foundational innovation to combat compounding error. Instead of predicting a single action at high frequency, action chunking involves predicting a sequence of actions (a 'chunk') and executing them in open-loop. This practice is counterintuitive, as it implicitly relaxes the Markovian assumption by committing to actions based on potentially stale observations. Mathematically, large enough action chunks can prevent compounding error, with the error bounded by horizon-independent constants that depend on system properties like stability and smoothness. The mechanism involves leveraging open-loop stability: if a small error is made in the first chunk, the system remains close to the nominal trajectory. If the chunk is long enough, the learned policy becomes closed-loop stable itself, ensuring that deviations from an ideal reset state rapidly contract. Experiments show that executing longer action chunks, not just supervising them, is salient for performance, reducing jitteriness and leading to smoother trajectories.
Generative control policies: Beyond multimodal matching
The second key intervention is the use of generative control policies (GCPs), particularly flow-based models. Initially motivated by the need to capture multimodal action distributions (e.g., choosing to go over or under an obstacle), research suggests their benefits lie elsewhere. A taxonomy identifies three aspects: distributional learning, stochasticity injection (adding noise during training via a latent variable 'Z'), and iterative computation (multiple steps in generating actions). Ablation studies indicate that while distributional learning is possible, it's not the primary driver of performance gains. Tasks where GCPs excel often require high precision rather than obvious multimodality. Experiments show minimal performance loss when multimodality is actively suppressed, suggesting that distributional matching is not the core benefit. Instead, a combination of stochasticity injection and iterative computation, embodied by the 'Minimal Iterative Policy' (MIP), often matches or exceeds GCP performance. This suggests GCPs may be addressing control-theoretic challenges rather than purely fitting complex distributions.
The inductive bias of generative models
While not fully understood theoretically, generative models like flows appear to approximate inductive biases beneficial for robotics. One hypothesis is that they emulate an advanced data collection strategy where noise is injected into actions, allowing the system to learn from a distribution of states and excitations of controllable modes. This process encourages learning on subspaces that, if ignored, could lead to instability. In practice, GCPs seem to exhibit lower 'off-manifold' error – errors perpendicular to desired directions of movement – which is crucial for tasks like precise insertion. This is contrasted with simple regression, which might learn the manifold itself but not necessarily in a way that promotes robustness to deviations. The iterative refinement in GCPs, combined with stochasticity, may provide a form of regularization that implicitly guides the learning process towards more stable and precise actions, even without explicit distributional diversity.
Practical implications and future directions
The insights from action chunking and generative policies explain the inflection point in robotics capabilities, making them scalable. While action chunking is vital when systems are open-loop stable, GCPs offer benefits even with inherent instabilities or high precision demands. The Minimal Iterative Policy (MIP) offers a simpler alternative for specific applications like high-precision tasks and fast inference, potentially serving as a shortcut policy. However, for pre-training at scale and robust reinforcement learning fine-tuning, full flow models with their iterative computation depth and noise resilience are still preferred. Beyond robotics, the 'test-time feedback phenomena' observed here could have implications for autoregressive language modeling and generative modeling, suggesting that optimization techniques designed to handle iterative error accumulation might be broadly applicable, especially in low-data regimes. The long-term dream of human-like rapid adaptation and novel situation learning in robots remains a significant challenge, possibly requiring new paradigms beyond current data-scaling approaches.
Mentioned in This Episode
●Software & Apps
●Concepts
●People Referenced
Common Questions
Moravec's Paradox observes that tasks humans find easy, like physical manipulation and perception, are historically harder for robots than complex reasoning tasks. This is because the data and algorithms for physical interaction are more challenging to acquire and develop.
Topics
Mentioned in this video
Represents the expert policy in imitation learning, which collects data to train a new policy (pi hat).
The learned policy in imitation learning, which aims to achieve cumulative reward close to that of the expert policy (pi star).
A policy designed to capture stochasticity and iterative computation without explicit distribution fitting, performing well on high-precision tasks and offering fast inference.
More from Stanford Online
View all 16 summaries
107 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 1 - Diffusion
74 minStanford AA228V I Validation of Safety Critical Systems I Explainability
37 minAI in Healthcare: Why Hospitals Are Moving Cautiously Toward Consolidation with Bob Wachter, MD
2 minDesign and Control of Haptic Systems: The Challenges of Robotics
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free