Key Moments

Stanford CS25: Transformers United V6 I From Next-Token Prediction to Next-Generation Intelligence

Stanford OnlineStanford Online
Education5 min read58 min video
May 11, 2026|4,006 views|128|3
Save to Pod
TL;DR

Front-loading reasoning during LLM pre-training yields durable gains, outperforming models that only learn reasoning later, even with more fine-tuning.

Key Insights

1

Volta, which follows a two-phase pre-training approach with an optimal data mixture, is on average 17% better than Pascal, who learns randomly.

2

Models that include reasoning data during pre-training (Ampere) show a 16% improvement right after pre-training compared to those that don't (Volta), and this advantage grows to 9.3% after Supervised Fine-Tuning (SFT).

3

High-quality reasoning data (SHQ), when combined with lower-quality data (LDQ) to form LMQ, shows no immediate benefit after pre-training but leads to a 4.25% boost after SFT.

4

Front-loading reasoning provides a 3% gain even when the 'no-reason base' model receives twice the SFT compute or when total reasoning data is fixed and allocated optimally.

5

RLP training, which trains models to 'think' before predicting the next token using an information gain reward, outperforms a standard next-token prediction baseline by 19% on the Qwen 1.7B model, even with token-matched training.

6

RLP training on 170 million tokens can achieve a 35% gain over a baseline trained on 20 trillion tokens for the NeMo Tron Nano 12B model.

The four pillars of building state-of-the-art LLMs

The speaker outlines four critical components for building advanced Large Language Models (LLMs): smart data, employing high-quality, diverse datasets with effective filtering and deduplication; smart architecture, considering evolving designs like Mamba 2 and hybrid models beyond transformers; smart algorithms, focusing on optimized training techniques; and smart collaboration between research, engineering, and post-training teams. The work discussed centers on developing smart algorithms and optimizing data usage within these frameworks.

Maximizing data potential with a two-phase pre-training approach

To leverage vast datasets effectively, a two-phase pre-training strategy is proposed. Phase one emphasizes data diversity, exposing the model to a wide range of data, including medium and low-quality web crawls alongside a smaller proportion of high-quality data. Phase two shifts focus exclusively to high-quality data sources like math, code, and Wikipedia, using them for a higher number of epochs. This approach, followed by Volta, significantly outperforms Pascal, who learns randomly, showing a 17% improvement on average. Volta also demonstrates a 3.4% gain over a model with an optimal data mixture but random ordering, highlighting the importance of structured data sequencing.

Front-loading reasoning for a stronger foundation

The conventional LLM pipeline often treats reasoning as a post-hoc addition during Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL), leading to a 'weak foundation' for reasoning. The proposed 'front-loading reasoning' approach integrates reasoning skill acquisition directly into the pre-training phase. This strategy, exemplified by Ampere, shows immediate benefits, outperforming Volta by 16% right after pre-training. Critically, these gains are not diluted 'washed away' by subsequent SFT, with Ampere maintaining a 9.3% advantage. This suggests that early exposure to reasoning data builds a more robust and persistent capability, making the model more receptive to further refinement.

The impact of reasoning data quality and quantity

An analysis of reasoning data categorizes it by quality, quantity, and diversity. Datasets like SHQ are small in quantity/diversity but high in quality, while LDQ is large in quantity/diversity but low in quality. LMQ is a combination of both. While combining SHQ and LDQ (LMQ) showed no immediate gain over LDQ after pre-training, it yielded a 4.25% boost after SFT. This indicates that high-quality data, even if less abundant initially, can unlock latent potential in downstream tasks, and its benefits may emerge later in the training pipeline without causing overfitting. This underscores the nuanced interplay between data characteristics and their effectiveness at different training stages.

Durable advantages from early reasoning exposure

Front-loading reasoning creates a lasting advantage that is difficult to overcome with later-stage improvements. Even if a 'no-reason base' model receives twice the SFT compute (2x epochs), the 'reason base' model (Ampere) still achieves a 3% gain with just one epoch of SFT. Furthermore, when the total 'reasoning data budget' is fixed, splitting it between pre-training and post-training (reason base) results in a 12% average improvement over using all reasoning data solely for SFT (no-reason base). This conclusively demonstrates that pre-training without reasoning data cannot be compensated for by increased SFT effort or data allocation, emphasizing the foundational importance of early reasoning integration.

Reinforcement Learning as a Pre-training Objective (RLP)

The Reinforcement Learning as a Pre-training Objective (RLP) framework aims to teach models to 'learn by doing' rather than just 'learning by observing' text. Unlike standard next-token prediction, RLP encourages the model to generate an explicit reasoning trace ('thought') before predicting the next token. This 'thought' is generated via a 'thought policy,' and its quality is assessed using an information gain-based reward (log P_theta - log P_phi), where P_theta is the probability with thought and P_phi is without. This reward is dense and applicable at every position, unlike sparse, binary rewards used in other methods. A delayed, 'no-think' baseline is used for comparison and to mitigate reward hacking. RLP shows significant improvements, outperforming standard next-token prediction by 19% on the Qwen 1.7B model and demonstrating remarkable data efficiency, with 250 million RLP tokens boosting the NeMo Tron Nano 12B model by 35% compared to a 20 trillion token baseline.

RLP's sustained performance and scalability

RLP's benefits are not limited to initial pre-training checkpoints and are robust across model sizes and architectures. In experiments using an intermediate checkpoint of the NeMo Tron Nano 12B model, RLP achieved a 35% average gain over a baseline trained on significantly more tokens (20 trillion vs. 19.8+ trillion tokens plus 250 million RLP tokens). This indicates substantial token efficiency, especially in domains like science. Even after identical post-training (SFT and RL), RLP maintains a 3% absolute margin over the baseline. The results suggest that RLP's advantages persist and potentially amplify with larger models and more complex architectures, opening new avenues for scaling reasoning capabilities.

RLP's edge over prior reinforcement pre-training methods

RLP distinguishes itself from prior methods like Reinforcement Pretraining (RPT) and Reinforcement Learning on Pretraining Data (RLPT) through its verifier-free, intrinsic reward mechanism. While RPT and RLPT rely on external verifiers and sparse, binary rewards, RLP uses a dense, information gain-based reward. This allows RLP to reinforce reasoning steps at every position without needing a separate model for reward calculation. In a direct comparison using 170 million tokens, RLP outperformed RPT by 4%, attributed to RLP's ability to capture the full reasoning signal via its dense reward, whereas RPT's sparse rewards and selective application ignore intermediate reasoning steps. RLP also demonstrates gains (7-9%) even when applied to unannotated text streams like web crawl data, showing its versatility.

Comparison of Learner Strategies (Pascal vs. Volta vs. Ampere vs. Hopper)

Data extracted from this episode

Strategy ComponentPascalVoltaAmpereHopper
Curriculum (Data Order)NoYesYesYes
Front-loading ReasoningNoNoYesYes
Learning Through ThinkingNoNoNoYes
Relative Improvement over PascalBaseline>17%Unknown~60%

Two-Phase Pre-training vs. Baselines

Data extracted from this episode

ApproachAverage Improvement (%)
Volta (Optimal Blend, No Order)3.4% (over random ordering)17.0% (over Pascal)
Two-phase Pre-training (Quality, Epochs, Order)UnknownUnknown

Front-loading Reasoning: Gains After Pre-training and SFT

Data extracted from this episode

Model ComparisonGain After Pre-training (%)Gain After SFT (%)
Ampere (Reason Base) vs. Volta (No Reason Base)16%9.3%

Impact of Reasoning Data Quality and Quantity (LMQ vs. LDQ)

Data extracted from this episode

Dataset ComparisonPerformance After Pre-trainingPerformance After SFT
LMQ (SHQ + LDQ) vs. LDQ (Low Quality)Similar performance4.25% boost for LMQ

RLP vs. Next Token Prediction Baselines (Qwen 1.7B)

Data extracted from this episode

Comparison TypeRLP Outperformance (%)
Token-Matched (1B tokens)19%17%
Flop-Matched (170M RLP tokens vs. 6B NTP tokens)14%

RLP vs. Base Model (NeMo Tron Nano 12B V2)

Data extracted from this episode

Comparison TypeTraining Data (Total Tokens)RLP Outperformance (%)
Base Model (20 Trillion) vs. Base + RLP (19.8T + 250M)Base: 20T; Base+RLP: ~19.8T35%

RLP vs. RPT Quantitative Comparison on OmniMath

Data extracted from this episode

TechniqueAverage Improvement (%)
RPTBaseline4%
RLPUnknownUnknown

Common Questions

Building a SOTA LLM requires four key components: smart data (high-quality, diverse, and well-filtered), smart architecture (evolving models like transformers and Mamba 2), smart algorithms (advanced training recipes), and smart collaboration between different teams (pre-training, post-training, research, and engineering).

Topics

Mentioned in this video

Software & Apps
GPT-3

A large language model trained around 2021 on hundreds of billions of tokens, serving as a reference point for data consumption evolution.

RLPT

A related paper and technique for Reinforcement Learning on Pretraining Data, compared against RLP.

Qwen 1.7B

A base model used in RLP experiments, showing significant improvements after RLP training.

Essential Web

A dataset and classification system that goes beyond educational quality to include domain and level of education.

LDQ

A reasoning dataset characterized by large diversity and quantity but low quality.

LMQ

A combined dataset (concatenation of SHQ and LDQ) used to study the impact of high-quality data in pre-training.

NeMo Tron Nano 12B V2

A large hybrid Mamba 2 based model used for experiments showing RLP benefits at scale.

FineWeb-Edu

A popular approach for classifying web crawl data quality based on educational content.

NeMo Tron

A family of large language models developed at NVIDIA, on which the speaker was a lead contributor.

Open Thoughts

An open-source dataset mentioned in the context of categorizing data quality, noted for being a diverse but less filtered source compared to others.

NemoTron SFT dataset

A dataset used for Supervised Fine-Tuning, noted for its diversity across domains and less stringent filtering, affecting its quality rating.

RPT

A related paper and technique for Reinforcement Pretraining, compared against RLP, which uses an external verifier and a sparse reward.

OmniMath

A dataset used for quantitative comparison between RPT and RLP techniques.

NemoTron Cross Think

A dataset from NVIDIA geared towards reasoning beyond math and code, which became a trending dataset.

SHQ

A reasoning dataset characterized by small quantity and diversity but high quality.

Llama 3

A large language model trained in late 2024 on tens of trillions of tokens, illustrating the rapid increase in data consumption for LLMs.

NemoTron CC Math

A dataset containing common math examples derived from Common Crawl documents, developed at NVIDIA.

Prismatic Synthesis

A method developed at NVIDIA to encourage diversity in synthetic data generation.

GRPO

A popular framework for post-training LLMs, with similarities in advantage calculation to RLP's reward mechanism.

Mamba

An advanced architecture mentioned as an evolution from transformers in the field of LLMs.

NemoTron Nano-2

A dataset available on Hugging Face, contributed to by the speaker during their time at NVIDIA.

More from Stanford Online

View all 48 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free