How does a 'two-phase pre-training' approach improve LLM performance?

The two-phase approach first focuses on diversity (Phase 1) by exposing the model to a broad range of data, including lower-quality sources and fewer epochs of high-quality data. Phase 2 then emphasizes high-quality data sources with more epochs, leading to better downstream accuracy.

What is 'front-loading reasoning' in LLM training?

Front-loading reasoning means incorporating reasoning-style data directly into the pre-training phase, rather than solely relying on post-training steps like SFT and RL. This builds a stronger, more durable foundation for reasoning capabilities from the start.

How does RLP (Reinforcement Learning Pre-training) differ from standard next-token prediction?

RLP allows the model to generate an explicit reasoning trace ('thought') before predicting the next token. This enables learning through exploration and provides a dense, information-gain-based reward, contrasting with the simple next-token prediction which lacks this reasoning step.

What are the main benefits of using RLP for LLM pre-training?

RLP establishes robust reasoning foundations during pre-training that persist through downstream alignment. It's also more token-efficient, outperforming next-token prediction significantly even when using less data or compute, and generalizes across different model architectures and data types.

How is data quality quantified when training LLMs?

Data quality is often quantified using classifiers. Popular methods include 'FineWeb EDU', which scores content on educational value, and systems like 'Essential Web' that classify based on domain, educational level, and other factors. LLM-based classification with rubrics is also used.

Can RLP improve LLM reasoning even when trained on general web crawl data?

Yes, RLP has shown gains of 7-9% even when trained on unannotated text streams like web crawl. This suggests RLP can teach reasoning-like behavior and strengthen foundational understanding across various data types, not just specialized reasoning datasets.

Key Moments

Stanford CS25: Transformers United V6 I From Next-Token Prediction to Next-Generation Intelligence

Stanford Online

Education5 min read58 min video

May 11, 2026|4,006 views|128|3

Stanford Stanford Online Transformers AI Artificial Intelligence

Save to Pod

Key Moments

TL;DR

Front-loading reasoning during LLM pre-training yields durable gains, outperforming models that only learn reasoning later, even with more fine-tuning.

Key Insights

Volta, which follows a two-phase pre-training approach with an optimal data mixture, is on average 17% better than Pascal, who learns randomly.

Models that include reasoning data during pre-training (Ampere) show a 16% improvement right after pre-training compared to those that don't (Volta), and this advantage grows to 9.3% after Supervised Fine-Tuning (SFT).

High-quality reasoning data (SHQ), when combined with lower-quality data (LDQ) to form LMQ, shows no immediate benefit after pre-training but leads to a 4.25% boost after SFT.

Front-loading reasoning provides a 3% gain even when the 'no-reason base' model receives twice the SFT compute or when total reasoning data is fixed and allocated optimally.

RLP training, which trains models to 'think' before predicting the next token using an information gain reward, outperforms a standard next-token prediction baseline by 19% on the Qwen 1.7B model, even with token-matched training.

RLP training on 170 million tokens can achieve a 35% gain over a baseline trained on 20 trillion tokens for the NeMo Tron Nano 12B model.

The four pillars of building state-of-the-art LLMs

The speaker outlines four critical components for building advanced Large Language Models (LLMs): smart data, employing high-quality, diverse datasets with effective filtering and deduplication; smart architecture, considering evolving designs like Mamba 2 and hybrid models beyond transformers; smart algorithms, focusing on optimized training techniques; and smart collaboration between research, engineering, and post-training teams. The work discussed centers on developing smart algorithms and optimizing data usage within these frameworks.

Maximizing data potential with a two-phase pre-training approach

To leverage vast datasets effectively, a two-phase pre-training strategy is proposed. Phase one emphasizes data diversity, exposing the model to a wide range of data, including medium and low-quality web crawls alongside a smaller proportion of high-quality data. Phase two shifts focus exclusively to high-quality data sources like math, code, and Wikipedia, using them for a higher number of epochs. This approach, followed by Volta, significantly outperforms Pascal, who learns randomly, showing a 17% improvement on average. Volta also demonstrates a 3.4% gain over a model with an optimal data mixture but random ordering, highlighting the importance of structured data sequencing.

Front-loading reasoning for a stronger foundation

The conventional LLM pipeline often treats reasoning as a post-hoc addition during Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL), leading to a 'weak foundation' for reasoning. The proposed 'front-loading reasoning' approach integrates reasoning skill acquisition directly into the pre-training phase. This strategy, exemplified by Ampere, shows immediate benefits, outperforming Volta by 16% right after pre-training. Critically, these gains are not diluted 'washed away' by subsequent SFT, with Ampere maintaining a 9.3% advantage. This suggests that early exposure to reasoning data builds a more robust and persistent capability, making the model more receptive to further refinement.

The impact of reasoning data quality and quantity

An analysis of reasoning data categorizes it by quality, quantity, and diversity. Datasets like SHQ are small in quantity/diversity but high in quality, while LDQ is large in quantity/diversity but low in quality. LMQ is a combination of both. While combining SHQ and LDQ (LMQ) showed no immediate gain over LDQ after pre-training, it yielded a 4.25% boost after SFT. This indicates that high-quality data, even if less abundant initially, can unlock latent potential in downstream tasks, and its benefits may emerge later in the training pipeline without causing overfitting. This underscores the nuanced interplay between data characteristics and their effectiveness at different training stages.

Durable advantages from early reasoning exposure

Front-loading reasoning creates a lasting advantage that is difficult to overcome with later-stage improvements. Even if a 'no-reason base' model receives twice the SFT compute (2x epochs), the 'reason base' model (Ampere) still achieves a 3% gain with just one epoch of SFT. Furthermore, when the total 'reasoning data budget' is fixed, splitting it between pre-training and post-training (reason base) results in a 12% average improvement over using all reasoning data solely for SFT (no-reason base). This conclusively demonstrates that pre-training without reasoning data cannot be compensated for by increased SFT effort or data allocation, emphasizing the foundational importance of early reasoning integration.

Reinforcement Learning as a Pre-training Objective (RLP)

The Reinforcement Learning as a Pre-training Objective (RLP) framework aims to teach models to 'learn by doing' rather than just 'learning by observing' text. Unlike standard next-token prediction, RLP encourages the model to generate an explicit reasoning trace ('thought') before predicting the next token. This 'thought' is generated via a 'thought policy,' and its quality is assessed using an information gain-based reward (log P_theta - log P_phi), where P_theta is the probability with thought and P_phi is without. This reward is dense and applicable at every position, unlike sparse, binary rewards used in other methods. A delayed, 'no-think' baseline is used for comparison and to mitigate reward hacking. RLP shows significant improvements, outperforming standard next-token prediction by 19% on the Qwen 1.7B model and demonstrating remarkable data efficiency, with 250 million RLP tokens boosting the NeMo Tron Nano 12B model by 35% compared to a 20 trillion token baseline.

RLP's sustained performance and scalability

RLP's benefits are not limited to initial pre-training checkpoints and are robust across model sizes and architectures. In experiments using an intermediate checkpoint of the NeMo Tron Nano 12B model, RLP achieved a 35% average gain over a baseline trained on significantly more tokens (20 trillion vs. 19.8+ trillion tokens plus 250 million RLP tokens). This indicates substantial token efficiency, especially in domains like science. Even after identical post-training (SFT and RL), RLP maintains a 3% absolute margin over the baseline. The results suggest that RLP's advantages persist and potentially amplify with larger models and more complex architectures, opening new avenues for scaling reasoning capabilities.

RLP's edge over prior reinforcement pre-training methods

RLP distinguishes itself from prior methods like Reinforcement Pretraining (RPT) and Reinforcement Learning on Pretraining Data (RLPT) through its verifier-free, intrinsic reward mechanism. While RPT and RLPT rely on external verifiers and sparse, binary rewards, RLP uses a dense, information gain-based reward. This allows RLP to reinforce reasoning steps at every position without needing a separate model for reward calculation. In a direct comparison using 170 million tokens, RLP outperformed RPT by 4%, attributed to RLP's ability to capture the full reasoning signal via its dense reward, whereas RPT's sparse rewards and selective application ignore intermediate reasoning steps. RLP also demonstrates gains (7-9%) even when applied to unannotated text streams like web crawl data, showing its versatility.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Studies Cited

●Concepts

●People Referenced