Why are traditional RNNs limited in complex reasoning tasks?

RNNs face limitations due to 'back prop through time,' which involves accumulating errors and noisy gradients over many steps. This makes it difficult to handle long sequences or complex computations without significant performance degradation.

How do Large Language Models (LLMs) differ from RNNs in terms of reasoning?

LLMs, particularly transformers, process inputs in a one-shot feed-forward manner during training, avoiding the vanishing gradient issues of RNNs. However, they lack the inherent latent reasoning and temporal compression found in RNNs, leading to limitations in tasks requiring step-by-step computation.

What is the core innovation of the HRM paper?

The HRM paper introduces an 'outer refinement loop' and uses a Deep Equilibrium Learning approach with truncated backpropagation. This clever training trick avoids backpropagating through all recursion steps, allowing for iterative refinement without the usual RNN gradient problems.

How does the TRM paper improve upon HRM?

TRM simplifies HRM by using a single network (instead of two distinct ones for low and high levels) with weight sharing. It also refines the truncated backpropagation, allowing it to backprop through one full latent recursion step, leading to significant performance gains with a smaller model.

Can tiny recursive models like TRM outperform larger LLMs?

Yes, TRM models, despite being much smaller (e.g., 7 million parameters), have shown to significantly outperform larger models (e.g., 27 million parameters) on specific tasks like ARC Prize. This highlights that recursion can be a scaling law, not just model size.

What is the future potential of combining recursion and large models?

Combining the recursive capabilities of models like TRM with the vast knowledge of large LLMs is seen as a major future direction. This could lead to highly efficient AI architectures capable of complex reasoning that significantly surpasses current capabilities.

Key Moments

Recursion Is The Next Scaling Law In AI

Y Combinator

Science & Technology5 min read38 min video

May 1, 2026|17,150 views|373|21

YC Y Combinator

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Tiny recursive AI models achieve state-of-the-art on complex reasoning tasks with a fraction of the parameters of giant LLMs, challenging the 'bigger is better' paradigm.

Key Insights

A 7-million parameter model (TRM) outperformed much larger models on ARC Prize and other reasoning tasks, demonstrating the power of recursion at inference time.

Standard LLMs are theoretically limited on certain reasoning tasks (like sorting) due to their one-shot, feed-forward nature, which lacks sufficient 'steps' for complex algorithms.

Hierarchical Reasoning Models (HRMs) leverage multiple levels of recursion and a novel 'truncated backpropagation through time' (stopping gradient early) training trick to achieve state-of-the-art results on ARC Prize with 27 million parameters.

TRMs simplify HRMs by using a single network for both low-level and high-level features (weight sharing) and still achieve superior performance with only 7 million parameters.

Both HRM and TRM training processes exhibit an 'expectation-maximization' (EM) like feel, updating latent states (ZL, ZH) iteratively without explicit chain-of-thought prompting.

The research suggests that recursion, not just model size, is a critical scaling law for AI reasoning capabilities, potentially enabling highly efficient, general-purpose models.

Recursive models outperform massive LLMs on reasoning tasks

The prevailing approach in AI has been to scale up Large Language Models (LLMs) by increasing parameter count and training data. However, recent research, particularly on Hierarchical Reasoning Models (HRMs) and Tiny Recursive Models (TRMs), highlights that recursion at inference time can unlock state-of-the-art reasoning capabilities with drastically fewer parameters. For instance, a 7-million parameter TRM model achieved superior performance on tasks like ARC Prize compared to models orders of magnitude larger, fundamentally questioning the 'bigger is always better' scaling law and suggesting recursion is the next crucial dimension for AI advancement.

Limitations of standard LLMs on complex reasoning

Standard Transformer-based LLMs, while powerful for next-token prediction and pattern matching, face theoretical limitations on tasks requiring sequential algorithmic steps, such as sorting or solving complex puzzles like Sudoku. Unlike Recurrent Neural Networks (RNNs) that process information iteratively, LLMs perform a one-shot, parallel feed-forward pass. This means that for a list of 31 elements, a transformer with only 30 layers might not have enough 'steps' or 'comparisons' to execute a standard sorting algorithm (which has a theoretical lower bound of n log n). The inherent sequential nature required for certain computations is not a native component of the LLM architecture. This limitation becomes even more pronounced with longer sequences or more complex problems, where the model runs out of capacity to perform all necessary operations within its fixed layer depth.

Hierarchical Reasoning Models introduce recursion and novel training

Hierarchical Reasoning Models (HRMs), inspired partly by biological brain structures operating at different frequencies, introduce explicit recursion. An HRM processes input through a low-level module (L-net) and a high-level module (H-net), with both modules performing a number of recursive steps (TL and TH, respectively). The key innovation lies in their training strategy, which deviates from traditional backpropagation through all recursion steps. Instead, HRMs employ a truncated backpropagation through time (truncated BPTT) mechanism where gradients are only backpropagated a fixed number of steps (e.g., to the beginning of the H-net recursion). This 'stop gradient' technique, combined with a Deep Equilibrium (DEQ) like fixed-point iteration approach that treats different points in the hidden state trajectory as mini-batches, allows the model to learn effectively without suffering from vanishing or exploding gradients inherent in deep unrolled RNNs training. This approach, applied to a 27-million parameter model trained on the ARC Prize dataset from scratch, achieved state-of-the-art results.

Tiny Recursive Models simplify and enhance HRM's success

Tiny Recursive Models (TRMs) build upon the HRM framework, focusing on simplifying and improving efficiency. A significant change is the collapse of the distinct L-net and H-net into a single network (dubbed 'net') that shares weights. This single network is responsible for extracting both low-level and high-level features, a concept that performed surprisingly well, even outperforming attention mechanisms on some tasks like Sudoku where MLPs showed comparable or better results. Furthermore, TRMs further refine the recursion and training. Instead of truncating backpropagation to a single recursion step as in some HRM implementations, TRMs backpropagate through one full latent recursion step after the main computation. Crucially, TRMs achieve this strong performance with substantially fewer parameters—down to 7 million—and demonstrate that adding more recursion is sufficient, not necessary, for performance gains, echoing findings from researchers like Melanie Mitchell.

Recursive training resembles expectation-maximization

The optimization process for both HRMs and TRMs has an iterative, expectation-maximization (EM) like quality. The models iteratively update low-level latent variables (ZL) conditioned on the input (X) and the current high-level latent variable (ZH), and then update the high-level latent variable (ZH) conditioned on the refined low-level variables. This process doesn't rely on explicit step-by-step reasoning like chain-of-thought (CoT) prompting. Instead, the model learns to use its latent space (often conceptualized as 'memory' or 'carry') to store intermediate computational states and gradually refine a proposed answer. This is analogous to solving a Sudoku puzzle: making small deductions based on current information to fill in more cells, rather than trying to guess every cell at once. The training process maximizes the probability of correct outcomes by learning an efficient strategy for updating and using these latent states.

Recursion offers efficient reasoning beyond parameter count

The success of TRMs and HRMs suggests a path towards more efficient AI reasoning. Instead of needing trillions of parameters and massive datasets to achieve reasoning prowess, these recursive models demonstrate 'compute depth' without parameter depth. They can effectively perform complex algorithmic computations and reasoning by iterating over a smaller set of weights and states. This is a departure from LLMs that primarily focus on finding rich embedding representations. The implication is that future AI could combine the general world knowledge and embedding capabilities of massive LLMs with the efficient, specialized reasoning power of tiny recursive models, leading to hybrid architectures that are both powerful and computationally tractable.

Beyond task-specific models: Generalization through recursion

A key characteristic of the discussed HRM and TRM models is their task-specificity; a model trained on Sudoku might not perform on ARC Prize without retraining. This contrasts with general-purpose LLMs that can adapt to various tasks through fine-tuning or few-shot learning. However, the underlying principle of recursion offers the potential to create general-purpose agents capable of complex reasoning. The vision is to use large models to generate powerful latent representations and then employ smaller, recursive reasoning modules within this latent space to perform complex computations. This hybrid approach could lead to AI systems that possess the broad understanding of LLMs while exhibiting the specialized reasoning efficiency and capabilities demonstrated by recursive models like TRMs, bridging the gap between current LLM limitations and true artificial general intelligence.

Mentioned in This Episode

●Software & Apps

●Companies

●Studies Cited

●Concepts

●People Referenced

Common Questions

Recursion in AI models refers to a process where a model calls itself repeatedly to perform a task. This approach aims to improve reasoning performance by iterating through computations, rather than solely relying on increasing model size.

Topics

AI & Machine Learning Technology & Innovation Computational Efficiency AI Research Trends Model Scaling Neural Network Architectures Recursive Models Hierarchical Reasoning

Mentioned in this video

Companies

DeepMind

Mentioned indirectly via the mention of 'Demis'. It is implied to be the organization where Demis Hassabis works.

Google

Mentioned as the source of 'recursion language models'.

Hermès

Mentioned as 'Hermes', likely referring to a specific model or framework.

Concepts

Deep Equilibrium Learning

A method for training models where instead of backpropagating through all layers, gradients are calculated at a fixed point.

Turing machine

A theoretical model of computation referenced to explain the limitations of LLMs in performing complex algorithms without external memory.

People

Alex Graves

Associated with adaptive compute time and keynote on RNNs from around 2016.

Software & Apps

ImageNet

A large-scale image dataset used for training and benchmarking computer vision models.

AlexNet

An early deep learning paper's architecture, notable for its inception of concepts like local receptive fields, which were later found to be unnecessary.

Studies & Research

CIFAR-10

A dataset of 60,000 small images used for computer vision research, often as a benchmark.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free