Key Moments
Stanford CS25: Transformers United V6 I From Next-Token Prediction to Next-Generation Intelligence
Key Moments
Front-loading reasoning during LLM pre-training yields durable gains, outperforming models that only learn reasoning later, even with more fine-tuning.
Key Insights
Volta, which follows a two-phase pre-training approach with an optimal data mixture, is on average 17% better than Pascal, who learns randomly.
Models that include reasoning data during pre-training (Ampere) show a 16% improvement right after pre-training compared to those that don't (Volta), and this advantage grows to 9.3% after Supervised Fine-Tuning (SFT).
High-quality reasoning data (SHQ), when combined with lower-quality data (LDQ) to form LMQ, shows no immediate benefit after pre-training but leads to a 4.25% boost after SFT.
Front-loading reasoning provides a 3% gain even when the 'no-reason base' model receives twice the SFT compute or when total reasoning data is fixed and allocated optimally.
RLP training, which trains models to 'think' before predicting the next token using an information gain reward, outperforms a standard next-token prediction baseline by 19% on the Qwen 1.7B model, even with token-matched training.
RLP training on 170 million tokens can achieve a 35% gain over a baseline trained on 20 trillion tokens for the NeMo Tron Nano 12B model.
The four pillars of building state-of-the-art LLMs
The speaker outlines four critical components for building advanced Large Language Models (LLMs): smart data, employing high-quality, diverse datasets with effective filtering and deduplication; smart architecture, considering evolving designs like Mamba 2 and hybrid models beyond transformers; smart algorithms, focusing on optimized training techniques; and smart collaboration between research, engineering, and post-training teams. The work discussed centers on developing smart algorithms and optimizing data usage within these frameworks.
Maximizing data potential with a two-phase pre-training approach
To leverage vast datasets effectively, a two-phase pre-training strategy is proposed. Phase one emphasizes data diversity, exposing the model to a wide range of data, including medium and low-quality web crawls alongside a smaller proportion of high-quality data. Phase two shifts focus exclusively to high-quality data sources like math, code, and Wikipedia, using them for a higher number of epochs. This approach, followed by Volta, significantly outperforms Pascal, who learns randomly, showing a 17% improvement on average. Volta also demonstrates a 3.4% gain over a model with an optimal data mixture but random ordering, highlighting the importance of structured data sequencing.
Front-loading reasoning for a stronger foundation
The conventional LLM pipeline often treats reasoning as a post-hoc addition during Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL), leading to a 'weak foundation' for reasoning. The proposed 'front-loading reasoning' approach integrates reasoning skill acquisition directly into the pre-training phase. This strategy, exemplified by Ampere, shows immediate benefits, outperforming Volta by 16% right after pre-training. Critically, these gains are not diluted 'washed away' by subsequent SFT, with Ampere maintaining a 9.3% advantage. This suggests that early exposure to reasoning data builds a more robust and persistent capability, making the model more receptive to further refinement.
The impact of reasoning data quality and quantity
An analysis of reasoning data categorizes it by quality, quantity, and diversity. Datasets like SHQ are small in quantity/diversity but high in quality, while LDQ is large in quantity/diversity but low in quality. LMQ is a combination of both. While combining SHQ and LDQ (LMQ) showed no immediate gain over LDQ after pre-training, it yielded a 4.25% boost after SFT. This indicates that high-quality data, even if less abundant initially, can unlock latent potential in downstream tasks, and its benefits may emerge later in the training pipeline without causing overfitting. This underscores the nuanced interplay between data characteristics and their effectiveness at different training stages.
Durable advantages from early reasoning exposure
Front-loading reasoning creates a lasting advantage that is difficult to overcome with later-stage improvements. Even if a 'no-reason base' model receives twice the SFT compute (2x epochs), the 'reason base' model (Ampere) still achieves a 3% gain with just one epoch of SFT. Furthermore, when the total 'reasoning data budget' is fixed, splitting it between pre-training and post-training (reason base) results in a 12% average improvement over using all reasoning data solely for SFT (no-reason base). This conclusively demonstrates that pre-training without reasoning data cannot be compensated for by increased SFT effort or data allocation, emphasizing the foundational importance of early reasoning integration.
Reinforcement Learning as a Pre-training Objective (RLP)
The Reinforcement Learning as a Pre-training Objective (RLP) framework aims to teach models to 'learn by doing' rather than just 'learning by observing' text. Unlike standard next-token prediction, RLP encourages the model to generate an explicit reasoning trace ('thought') before predicting the next token. This 'thought' is generated via a 'thought policy,' and its quality is assessed using an information gain-based reward (log P_theta - log P_phi), where P_theta is the probability with thought and P_phi is without. This reward is dense and applicable at every position, unlike sparse, binary rewards used in other methods. A delayed, 'no-think' baseline is used for comparison and to mitigate reward hacking. RLP shows significant improvements, outperforming standard next-token prediction by 19% on the Qwen 1.7B model and demonstrating remarkable data efficiency, with 250 million RLP tokens boosting the NeMo Tron Nano 12B model by 35% compared to a 20 trillion token baseline.
RLP's sustained performance and scalability
RLP's benefits are not limited to initial pre-training checkpoints and are robust across model sizes and architectures. In experiments using an intermediate checkpoint of the NeMo Tron Nano 12B model, RLP achieved a 35% average gain over a baseline trained on significantly more tokens (20 trillion vs. 19.8+ trillion tokens plus 250 million RLP tokens). This indicates substantial token efficiency, especially in domains like science. Even after identical post-training (SFT and RL), RLP maintains a 3% absolute margin over the baseline. The results suggest that RLP's advantages persist and potentially amplify with larger models and more complex architectures, opening new avenues for scaling reasoning capabilities.
RLP's edge over prior reinforcement pre-training methods
RLP distinguishes itself from prior methods like Reinforcement Pretraining (RPT) and Reinforcement Learning on Pretraining Data (RLPT) through its verifier-free, intrinsic reward mechanism. While RPT and RLPT rely on external verifiers and sparse, binary rewards, RLP uses a dense, information gain-based reward. This allows RLP to reinforce reasoning steps at every position without needing a separate model for reward calculation. In a direct comparison using 170 million tokens, RLP outperformed RPT by 4%, attributed to RLP's ability to capture the full reasoning signal via its dense reward, whereas RPT's sparse rewards and selective application ignore intermediate reasoning steps. RLP also demonstrates gains (7-9%) even when applied to unannotated text streams like web crawl data, showing its versatility.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Studies Cited
●Concepts
●People Referenced
Comparison of Learner Strategies (Pascal vs. Volta vs. Ampere vs. Hopper)
Data extracted from this episode
| Strategy Component | Pascal | Volta | Ampere | Hopper |
|---|---|---|---|---|
| Curriculum (Data Order) | No | Yes | Yes | Yes |
| Front-loading Reasoning | No | No | Yes | Yes |
| Learning Through Thinking | No | No | No | Yes |
| Relative Improvement over Pascal | Baseline | >17% | Unknown | ~60% |
Two-Phase Pre-training vs. Baselines
Data extracted from this episode
| Approach | Average Improvement (%) | |
|---|---|---|
| Volta (Optimal Blend, No Order) | 3.4% (over random ordering) | 17.0% (over Pascal) |
| Two-phase Pre-training (Quality, Epochs, Order) | Unknown | Unknown |
Front-loading Reasoning: Gains After Pre-training and SFT
Data extracted from this episode
| Model Comparison | Gain After Pre-training (%) | Gain After SFT (%) |
|---|---|---|
| Ampere (Reason Base) vs. Volta (No Reason Base) | 16% | 9.3% |
Impact of Reasoning Data Quality and Quantity (LMQ vs. LDQ)
Data extracted from this episode
| Dataset Comparison | Performance After Pre-training | Performance After SFT |
|---|---|---|
| LMQ (SHQ + LDQ) vs. LDQ (Low Quality) | Similar performance | 4.25% boost for LMQ |
RLP vs. Next Token Prediction Baselines (Qwen 1.7B)
Data extracted from this episode
| Comparison Type | RLP Outperformance (%) | |
|---|---|---|
| Token-Matched (1B tokens) | 19% | 17% |
| Flop-Matched (170M RLP tokens vs. 6B NTP tokens) | 14% |
RLP vs. Base Model (NeMo Tron Nano 12B V2)
Data extracted from this episode
| Comparison Type | Training Data (Total Tokens) | RLP Outperformance (%) |
|---|---|---|
| Base Model (20 Trillion) vs. Base + RLP (19.8T + 250M) | Base: 20T; Base+RLP: ~19.8T | 35% |
RLP vs. RPT Quantitative Comparison on OmniMath
Data extracted from this episode
| Technique | Average Improvement (%) | |
|---|---|---|
| RPT | Baseline | 4% |
| RLP | Unknown | Unknown |
Common Questions
Building a SOTA LLM requires four key components: smart data (high-quality, diverse, and well-filtered), smart architecture (evolving models like transformers and Mamba 2), smart algorithms (advanced training recipes), and smart collaboration between different teams (pre-training, post-training, research, and engineering).
Topics
Mentioned in this video
A large language model trained around 2021 on hundreds of billions of tokens, serving as a reference point for data consumption evolution.
A related paper and technique for Reinforcement Learning on Pretraining Data, compared against RLP.
A base model used in RLP experiments, showing significant improvements after RLP training.
A dataset and classification system that goes beyond educational quality to include domain and level of education.
A reasoning dataset characterized by large diversity and quantity but low quality.
A combined dataset (concatenation of SHQ and LDQ) used to study the impact of high-quality data in pre-training.
A large hybrid Mamba 2 based model used for experiments showing RLP benefits at scale.
A popular approach for classifying web crawl data quality based on educational content.
A family of large language models developed at NVIDIA, on which the speaker was a lead contributor.
An open-source dataset mentioned in the context of categorizing data quality, noted for being a diverse but less filtered source compared to others.
A dataset used for Supervised Fine-Tuning, noted for its diversity across domains and less stringent filtering, affecting its quality rating.
A related paper and technique for Reinforcement Pretraining, compared against RLP, which uses an external verifier and a sparse reward.
A dataset used for quantitative comparison between RPT and RLP techniques.
A dataset from NVIDIA geared towards reasoning beyond math and code, which became a trending dataset.
A reasoning dataset characterized by small quantity and diversity but high quality.
A large language model trained in late 2024 on tens of trillions of tokens, illustrating the rapid increase in data consumption for LLMs.
A dataset containing common math examples derived from Common Crawl documents, developed at NVIDIA.
A method developed at NVIDIA to encourage diversity in synthetic data generation.
A popular framework for post-training LLMs, with similarities in advantage calculation to RLP's reward mechanism.
An advanced architecture mentioned as an evolution from transformers in the field of LLMs.
A dataset available on Hugging Face, contributed to by the speaker during their time at NVIDIA.
A hypothetical learner representing a baseline approach to learning that does not follow a curriculum or utilize reasoning data effectively.
A hypothetical learner representing the optimal approach, utilizing curriculum, front-loading reasoning, and learning through active thinking.
A hypothetical learner representing an improvement over Pascal by following a curriculum and creating an optimal data mixture, but not utilizing reasoning data effectively.
A hypothetical learner representing an improvement over Volta by following a curriculum and front-loading reasoning data, but not learning through active thinking.
A hypothetical learner who learns by observing and analyzing, contrasted with Leo who learns by doing.
Reinforcement Learning from AI Feedback, an alternative to RLHF that may reduce subjective bias.
A hypothetical learner who learns by doing, contrasting with Bolt who learns by observing.
Company where the speaker recently joined, specializing in AI and large language models.
Platform where NVIDIA's datasets, including NeMoTron, are made available in an open-source manner.
A company focused on alignment research and improving LLM safety, mentioned in the context of reducing reward hacking.
Company where the speaker previously worked, contributing to the NeMo Tron family of models and pre-training pipelines.
More from Stanford Online
View all 48 summaries
69 minStanford CS153 Frontier Systems | Jensen Huang from NVIDIA on the Compute Behind Intelligence
61 minStanford CS153 Frontier Systems | Scott Nolan from General Matter on Energy Bottlenecks
63 minStanford Robotics Seminar ENGR319 | Spring 2026 | Unlocking Autonomous Medical Robotics
107 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 5 - Architectures
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free