Key Moments
Recursion Is The Next Scaling Law In AI
Key Moments
Tiny recursive AI models achieve state-of-the-art on complex reasoning tasks with a fraction of the parameters of giant LLMs, challenging the 'bigger is better' paradigm.
Key Insights
A 7-million parameter model (TRM) outperformed much larger models on ARC Prize and other reasoning tasks, demonstrating the power of recursion at inference time.
Standard LLMs are theoretically limited on certain reasoning tasks (like sorting) due to their one-shot, feed-forward nature, which lacks sufficient 'steps' for complex algorithms.
Hierarchical Reasoning Models (HRMs) leverage multiple levels of recursion and a novel 'truncated backpropagation through time' (stopping gradient early) training trick to achieve state-of-the-art results on ARC Prize with 27 million parameters.
TRMs simplify HRMs by using a single network for both low-level and high-level features (weight sharing) and still achieve superior performance with only 7 million parameters.
Both HRM and TRM training processes exhibit an 'expectation-maximization' (EM) like feel, updating latent states (ZL, ZH) iteratively without explicit chain-of-thought prompting.
The research suggests that recursion, not just model size, is a critical scaling law for AI reasoning capabilities, potentially enabling highly efficient, general-purpose models.
Recursive models outperform massive LLMs on reasoning tasks
The prevailing approach in AI has been to scale up Large Language Models (LLMs) by increasing parameter count and training data. However, recent research, particularly on Hierarchical Reasoning Models (HRMs) and Tiny Recursive Models (TRMs), highlights that recursion at inference time can unlock state-of-the-art reasoning capabilities with drastically fewer parameters. For instance, a 7-million parameter TRM model achieved superior performance on tasks like ARC Prize compared to models orders of magnitude larger, fundamentally questioning the 'bigger is always better' scaling law and suggesting recursion is the next crucial dimension for AI advancement.
Limitations of standard LLMs on complex reasoning
Standard Transformer-based LLMs, while powerful for next-token prediction and pattern matching, face theoretical limitations on tasks requiring sequential algorithmic steps, such as sorting or solving complex puzzles like Sudoku. Unlike Recurrent Neural Networks (RNNs) that process information iteratively, LLMs perform a one-shot, parallel feed-forward pass. This means that for a list of 31 elements, a transformer with only 30 layers might not have enough 'steps' or 'comparisons' to execute a standard sorting algorithm (which has a theoretical lower bound of n log n). The inherent sequential nature required for certain computations is not a native component of the LLM architecture. This limitation becomes even more pronounced with longer sequences or more complex problems, where the model runs out of capacity to perform all necessary operations within its fixed layer depth.
Hierarchical Reasoning Models introduce recursion and novel training
Hierarchical Reasoning Models (HRMs), inspired partly by biological brain structures operating at different frequencies, introduce explicit recursion. An HRM processes input through a low-level module (L-net) and a high-level module (H-net), with both modules performing a number of recursive steps (TL and TH, respectively). The key innovation lies in their training strategy, which deviates from traditional backpropagation through all recursion steps. Instead, HRMs employ a truncated backpropagation through time (truncated BPTT) mechanism where gradients are only backpropagated a fixed number of steps (e.g., to the beginning of the H-net recursion). This 'stop gradient' technique, combined with a Deep Equilibrium (DEQ) like fixed-point iteration approach that treats different points in the hidden state trajectory as mini-batches, allows the model to learn effectively without suffering from vanishing or exploding gradients inherent in deep unrolled RNNs training. This approach, applied to a 27-million parameter model trained on the ARC Prize dataset from scratch, achieved state-of-the-art results.
Tiny Recursive Models simplify and enhance HRM's success
Tiny Recursive Models (TRMs) build upon the HRM framework, focusing on simplifying and improving efficiency. A significant change is the collapse of the distinct L-net and H-net into a single network (dubbed 'net') that shares weights. This single network is responsible for extracting both low-level and high-level features, a concept that performed surprisingly well, even outperforming attention mechanisms on some tasks like Sudoku where MLPs showed comparable or better results. Furthermore, TRMs further refine the recursion and training. Instead of truncating backpropagation to a single recursion step as in some HRM implementations, TRMs backpropagate through one full latent recursion step after the main computation. Crucially, TRMs achieve this strong performance with substantially fewer parameters—down to 7 million—and demonstrate that adding more recursion is sufficient, not necessary, for performance gains, echoing findings from researchers like Melanie Mitchell.
Recursive training resembles expectation-maximization
The optimization process for both HRMs and TRMs has an iterative, expectation-maximization (EM) like quality. The models iteratively update low-level latent variables (ZL) conditioned on the input (X) and the current high-level latent variable (ZH), and then update the high-level latent variable (ZH) conditioned on the refined low-level variables. This process doesn't rely on explicit step-by-step reasoning like chain-of-thought (CoT) prompting. Instead, the model learns to use its latent space (often conceptualized as 'memory' or 'carry') to store intermediate computational states and gradually refine a proposed answer. This is analogous to solving a Sudoku puzzle: making small deductions based on current information to fill in more cells, rather than trying to guess every cell at once. The training process maximizes the probability of correct outcomes by learning an efficient strategy for updating and using these latent states.
Recursion offers efficient reasoning beyond parameter count
The success of TRMs and HRMs suggests a path towards more efficient AI reasoning. Instead of needing trillions of parameters and massive datasets to achieve reasoning prowess, these recursive models demonstrate 'compute depth' without parameter depth. They can effectively perform complex algorithmic computations and reasoning by iterating over a smaller set of weights and states. This is a departure from LLMs that primarily focus on finding rich embedding representations. The implication is that future AI could combine the general world knowledge and embedding capabilities of massive LLMs with the efficient, specialized reasoning power of tiny recursive models, leading to hybrid architectures that are both powerful and computationally tractable.
Beyond task-specific models: Generalization through recursion
A key characteristic of the discussed HRM and TRM models is their task-specificity; a model trained on Sudoku might not perform on ARC Prize without retraining. This contrasts with general-purpose LLMs that can adapt to various tasks through fine-tuning or few-shot learning. However, the underlying principle of recursion offers the potential to create general-purpose agents capable of complex reasoning. The vision is to use large models to generate powerful latent representations and then employ smaller, recursive reasoning modules within this latent space to perform complex computations. This hybrid approach could lead to AI systems that possess the broad understanding of LLMs while exhibiting the specialized reasoning efficiency and capabilities demonstrated by recursive models like TRMs, bridging the gap between current LLM limitations and true artificial general intelligence.
Mentioned in This Episode
●Software & Apps
●Companies
●Studies Cited
●Concepts
●People Referenced
Common Questions
Recursion in AI models refers to a process where a model calls itself repeatedly to perform a task. This approach aims to improve reasoning performance by iterating through computations, rather than solely relying on increasing model size.
Topics
Mentioned in this video
A method for training models where instead of backpropagating through all layers, gradients are calculated at a fixed point.
A theoretical model of computation referenced to explain the limitations of LLMs in performing complex algorithms without external memory.
More from Y Combinator
View all 587 summaries
41 minHow to Build the Future: Demis Hassabis
40 minReplit's CEO On The Only Two Jobs Left In The Company Of The Future
22 minHow to Make Claude Code Your AI Engineering Team
44 minHow Stripe Built Their New Website
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free