Why is traditional behavior cloning insufficient for robotics?

Behavior cloning minimizes fitting error on expert data but doesn't guarantee maximizing cumulative reward. In continuous control, small errors can compound over time, leading to instability and poor performance, a phenomenon known as the cursive horizon.

What is the negative result regarding learned policies in robotics?

Even when the expert policy and environment dynamics are stable, a learned policy can induce instability. This occurs in a controlled setting where the learned policy might fit demonstrated data well but still lead to compounding errors when rolled out.

How does action chunking help overcome instability in robotics?

Action chunking, by predicting and executing sequences of actions in open loop, effectively removes the restriction of Markovian policies. If chunks are sufficiently long and the system is open-loop stable, action chunking helps avoid compounding error and promotes closed-loop stability.

What is the role of generative control policies like flow matching?

Generative control policies, such as those using flow matching, were initially thought to capture multimodal data distributions. However, research suggests they are more critical for addressing control-theoretic challenges, providing benefits like stochasticity injection and iterative computation.

What is the Minimal Iterative Policy (MIP) and why is it significant?

MIP combines stochasticity injection and iterative computation without explicit distribution fitting. It performs comparably to advanced flow models on many tasks, especially high-precision ones, and offers faster inference, making it a valuable tool for specific applications.

Key Moments

Stanford Robotics Seminar ENGR319 | Winter 2026 | Gen Control, Action Chunking, Moravec’s Paradox

Q: Can insights from robotics generalize to other AI domains?

Yes, the phenomenon of compounding error through test-time feedback, observed in robotics, can also occur in autoregressive language modeling and generative modeling. New optimization algorithms are being developed to mitigate this feedback effect, especially in data-scarce scenarios.

Stanford Online

Education5 min read54 min video

Apr 10, 2026|232 views|26

Stanford Stanford Online Robotics

Save to Pod

Key Moments

TL;DR

Robots struggle with physical tasks due to compounding errors, but action chunking and generative policies offer solutions by improving stability and learning inductive biases, making robotics more scalable.

Key Insights

Historically, robots have found physical tasks like motion and manipulation harder than symbolic reasoning tasks, a phenomenon known as Moravec's Paradox.

Behavior cloning, while effective for supervised learning, suffers from compounding error in continuous control settings, leading to instability even if expert and dynamics are stable.

Action chunking, the practice of predicting multiple actions in sequence and executing them open-loop, significantly mitigates compounding error by ensuring closed-loop stability for sufficiently large chunks.

Generative control policies, particularly flow-based models, show benefits not primarily from distributional learning but from stochasticity injection and iterative computation, leading to lower off-manifold error.

The minimal iterative policy (MIP) demonstrates that the combination of stochasticity and iterative computation, without explicit distributional fitting, can match or exceed flow policy performance on many tasks.

Injecting noise into actions during data collection can theoretically guarantee imitation without compounding error by exciting controllable modes in the system, a principle that flow policies seem to emulate.

The persistent difficulty of physical robot tasks

The seminar opens by revisiting Moravec's Paradox, an observation that performing tasks requiring high-level reasoning, like solving math problems or playing chess, has historically been easier for robots than tasks involving basic human motor skills and perception, such as walking, grasping, or manipulating objects in the physical world. Despite significant progress, especially in the last few years, in enabling robots to perform narrowly scoped manipulation tasks through techniques like behavior cloning, a core challenge remains: the fundamental difficulty in learning from demonstration in continuous control settings. The speaker posits that this isn't just a data problem but an 'algorithmic Moravec's Paradox,' where specific algorithmic interventions are crucial for learning effectively from collected data, especially in the physical domain.

Compounding error: The Achilles' heel of behavior cloning

In traditional behavior cloning (BC), algorithms aim to minimize supervised learning errors between expert demonstrations and the learned policy. However, the critical issue isn't generalization on training data but the mismatch between minimizing fitting error and maximizing cumulative reward. This arises because BC minimizes error at each state observed in the training data, yet during rollout, the policy generates its own states, potentially leading to small errors that accumulate exponentially over the 'cursive horizon.' Using the MDP formalism, the talk frames robot learning as a control problem where stability is key. Even when expert policies and system dynamics are stable in both open-loop and closed-loop senses, a learned policy can itself induce instability. A negative result is presented: even in the 'nicest' possible settings (smooth, deterministic, Markovian, stable expert and dynamics, and restricted learner policies), any algorithm only using demonstration data faces an exponential compounding error, and this limitation applies even to offline reinforcement learning algorithms and persists even if the reward function is known.

Action chunking: Stabilizing robot execution

Action chunking emerges as a foundational innovation to combat compounding error. Instead of predicting a single action at high frequency, action chunking involves predicting a sequence of actions (a 'chunk') and executing them in open-loop. This practice is counterintuitive, as it implicitly relaxes the Markovian assumption by committing to actions based on potentially stale observations. Mathematically, large enough action chunks can prevent compounding error, with the error bounded by horizon-independent constants that depend on system properties like stability and smoothness. The mechanism involves leveraging open-loop stability: if a small error is made in the first chunk, the system remains close to the nominal trajectory. If the chunk is long enough, the learned policy becomes closed-loop stable itself, ensuring that deviations from an ideal reset state rapidly contract. Experiments show that executing longer action chunks, not just supervising them, is salient for performance, reducing jitteriness and leading to smoother trajectories.

Generative control policies: Beyond multimodal matching

The second key intervention is the use of generative control policies (GCPs), particularly flow-based models. Initially motivated by the need to capture multimodal action distributions (e.g., choosing to go over or under an obstacle), research suggests their benefits lie elsewhere. A taxonomy identifies three aspects: distributional learning, stochasticity injection (adding noise during training via a latent variable 'Z'), and iterative computation (multiple steps in generating actions). Ablation studies indicate that while distributional learning is possible, it's not the primary driver of performance gains. Tasks where GCPs excel often require high precision rather than obvious multimodality. Experiments show minimal performance loss when multimodality is actively suppressed, suggesting that distributional matching is not the core benefit. Instead, a combination of stochasticity injection and iterative computation, embodied by the 'Minimal Iterative Policy' (MIP), often matches or exceeds GCP performance. This suggests GCPs may be addressing control-theoretic challenges rather than purely fitting complex distributions.

The inductive bias of generative models

While not fully understood theoretically, generative models like flows appear to approximate inductive biases beneficial for robotics. One hypothesis is that they emulate an advanced data collection strategy where noise is injected into actions, allowing the system to learn from a distribution of states and excitations of controllable modes. This process encourages learning on subspaces that, if ignored, could lead to instability. In practice, GCPs seem to exhibit lower 'off-manifold' error – errors perpendicular to desired directions of movement – which is crucial for tasks like precise insertion. This is contrasted with simple regression, which might learn the manifold itself but not necessarily in a way that promotes robustness to deviations. The iterative refinement in GCPs, combined with stochasticity, may provide a form of regularization that implicitly guides the learning process towards more stable and precise actions, even without explicit distributional diversity.

Practical implications and future directions

The insights from action chunking and generative policies explain the inflection point in robotics capabilities, making them scalable. While action chunking is vital when systems are open-loop stable, GCPs offer benefits even with inherent instabilities or high precision demands. The Minimal Iterative Policy (MIP) offers a simpler alternative for specific applications like high-precision tasks and fast inference, potentially serving as a shortcut policy. However, for pre-training at scale and robust reinforcement learning fine-tuning, full flow models with their iterative computation depth and noise resilience are still preferred. Beyond robotics, the 'test-time feedback phenomena' observed here could have implications for autoregressive language modeling and generative modeling, suggesting that optimization techniques designed to handle iterative error accumulation might be broadly applicable, especially in low-data regimes. The long-term dream of human-like rapid adaptation and novel situation learning in robots remains a significant challenge, possibly requiring new paradigms beyond current data-scaling approaches.

Mentioned in This Episode

●Software & Apps

●Concepts

●People Referenced

Common Questions

Moravec's Paradox observes that tasks humans find easy, like physical manipulation and perception, are historically harder for robots than complex reasoning tasks. This is because the data and algorithms for physical interaction are more challenging to acquire and develop.

Topics

AI & Machine Learning Technology & Innovation Science & Mathematics Deep Learning Imitation Learning Continuous Control Control Theory Policy Learning Algorithmic Breakthroughs

Mentioned in this video

Concepts

Moravec's Paradox

An observation noting that historically, it's been harder to make robots perform seemingly simple human tasks (like physical motion and perception) than complex reasoning tasks (like solving math problems or playing chess).

Software & Apps

pi star

Represents the expert policy in imitation learning, which collects data to train a new policy (pi hat).

pi hat

The learned policy in imitation learning, which aims to achieve cumulative reward close to that of the expert policy (pi star).

minimal iterative policy

A policy designed to capture stochasticity and iterative computation without explicit distribution fitting, performing well on high-precision tasks and offering fast inference.

People

Chia Pan

Lead researcher on the generative control policy intervention discussed in the talk.

Chelsea Finn

Mentioned for her early work in robotics that used torque-level control instead of position control, suggesting that with sufficient data scaling, torque control might become viable again.