Key Moments
Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Reinforcement Learning from Verifiable Rewards (RLVR) enables AI to solve complex problems like math and coding, but GRPO, a simplified RL algorithm, is key for open-source adoption despite theoretical deviations from traditional RL.
Key Insights
RLHF faces an 'overoptimization' problem due to reward model overfitting, motivating the need for more robust RL methods.
PPO, while theoretically foundational for RL in language models, is notoriously difficult to implement correctly, with many implementation details significantly impacting performance.
GRPO (Generalized Relative Policy Optimization) simplifies PPO by removing the value function and using z-score normalization for advantage estimation, making it more accessible for open-source development.
DeepSeek R1 demonstrated the effectiveness of GRPO on math problems without process supervision, achieving performance close to OpenAI's models with a simpler recipe.
Kimmy K1.5, while also beating OpenAI's models, used a DPO-inspired derivation that led to an objective similar to GRPO, suggesting convergence on effective RLVR techniques from different theoretical paths.
Agentic RLVR, as seen in Quen Coder Next, heavily relies on extensive mid-training data collection and robust reward models to prevent 'hacking' by the agent, highlighting data quality and reward design as critical.
Limitations of RLHF and the need for RLVR
The lecture begins by addressing the limitations of Reinforcement Learning from Human Feedback (RLHF), particularly the issue of 'overoptimization' and annotation bottlenecks. While RLHF yields good results, like those seen in ChatGPT, it struggles with overfitting the reward model. This is contrasted with domains like the game of Go, where RL (e.g., AlphaGo) excels because the objective is precisely defined and verifiable. This distinction motivates the exploration of RLVR for tasks with verifiable outcomes, such as formal mathematics and coding, where reinforcement learning can be more naturally applied.
The complexities and sensitivities of PPO
The core algorithms for RL are introduced, starting with Proximal Policy Optimization (PPO). The lecture emphasizes that while PPO is a workhorse in reinforcement learning, its pseudocode might appear simple, real-world implementations are fraught with '37 implementation details' that are highly sensitive and lead to vastly different results. Many common implementations have been found to be incorrect, fundamentally altering the optimization problem. The complexity arises from components like advantage estimation, experience replay, value function training, and the KL divergence term, which can operate token-by-token, making it a multi-step RL problem rather than a simple bandit problem. This sensitivity necessitates careful engineering and often leads to 'hacks' for stabilization, making PPO finicky and complicated for researchers implementing it from scratch. Furthermore, PPO requires a value model as large as the policy model, consuming significant memory.
GRPO: A simpler, accessible RLVR alternative
To address the complexities of PPO, Generalized Relative Policy Optimization (GRPO) is presented as a more accessible algorithm for RLVR, particularly within the open-source community. GRPO simplifies PPO by removing the value function—a major source of complexity and instability. Instead of a value function, GRPO uses the z-score of rewards within a group of rollouts to estimate the advantage. This means comparing an individual rollout's reward to the mean and standard deviation of a batch of rollouts. Conceptually, GRPO aims to simplify the RL update. For the objective, it uses a clipped advantage (similar to PPO) and a KL term to remain close to a reference policy. The advantage is calculated as a z-score: (reward - mean) / standard deviation. In online settings, the clipping disappears, leaving a simple advantage minus a KL penalty. This makes GRPO easier to implement, suitable for a one-page implementation, and has led to its widespread adoption in open-source models. The DeepSeek math paper is highlighted as an early adopter demonstrating GRPO's effectiveness.
DeepSeek R1: Demonstrating GRPO's effectiveness
DeepSeek R1 is presented as a significant milestone, being one of the first open-source models to match OpenAI's 01 in capabilities like long chains of thought and performance on hard math problems. A key aspect of DeepSeek R1 was its use of GRPO and its abandonment of process supervision in favor of outcome supervision (reward based solely on the final answer). The recipe was simple: a base model, followed by GRPO with accuracy and format rewards. Format rewards were used to ensure the model used thinking tags correctly and could later strip the chain of thought if desired. This straightforward approach, despite not using process supervision or elaborate RL techniques like MCTS, yielded impressive results. The paper also noted phenomena like increasing chain-of-thought length during training and 'aha' moments, though the lecture discusses these as potential side effects of GRPO's length normalization or pre-training artifacts rather than novel RL discoveries.
Kimmy K1.5: A DPO-inspired path to similar RL effectiveness
Kimmy K1.5 is discussed as another model that achieved strong results, even beating OpenAI's 01, yet often overlooked compared to DeepSeek. Kimmy's approach offers a different perspective, starting with a DPO-inspired derivation and arriving at an objective that shares intuitions with GRPO. They frame the problem as matching a reward model and use a squared loss heuristic to achieve this, leading to an update rule that resembles GRPO with an added regularizer. A notable contribution from Kimmy is their focus on data curation and curriculum generation for RL. They filter examples based on success rates and difficulty to optimize compute and learning, advocating for 'medium-range difficulty' problems. Unlike GRPO's length normalization, Kimmy's objective does not normalize by sequence length and instead incorporates a heuristic length reward aiming to compress responses, balancing efficiency with avoiding unboundedly long incorrect outputs. They also highlight that RL methods generally outperform simpler 'expert iteration' baselines.
Quen 3 and Quen Coder Next: Scaling and Agentic RLVR
The Quen models (Quen 3 and Quen Coder Next) showcase sophisticated applications of RLVR. Quen 3 follows a similar pipeline to DeepSeek: base model SFT, reasoning RL, RLHF, and distillation. It uses GRPO and employs extensive data filtering for difficulty and relevance. A unique aspect was their attempt to fuse 'thinking' and 'non-thinking' modes into a single model, with performance degrading gracefully upon early termination of the thinking process. Quen Coder Next, focused on agentic capabilities, highlights the paramount importance of mid-training data collection. This phase includes processing code repositories, pull requests, and even running publicly available coding agents. A critical innovation is their robust reward design for agentic tasks; they specifically developed rewards to prevent agents from 'hacking' the Git history or other reward mechanisms. This reinforces the tenet that RLVR is only as robust as its reward models, which must be unhackable to prevent the agent from finding exploitative shortcuts. The ability to achieve high performance (e.g., 70.6% on SWEBench) with a relatively small model (3 billion parameters) using extensive RL is a testament to the power of well-designed RLVR and data.
The role of data and reward robustness in RLVR
Across all discussed models and algorithms, a recurring theme is the critical role of data and reward models. While algorithms like GRPO have made RLVR more accessible, the effectiveness hinges on having high-quality, well-curated data and, crucially, unhackable reward signals. The lecture points out that even in formal systems like the Lean compiler, adversarial robustness issues can arise. For agentic RLVR, reward hacking is a significant risk, necessitating careful design to ensure agents learn genuine task completion rather than exploiting loopholes in the reward mechanism. The DeepSeek R1 investigation into process supervision versus outcome supervision and Kimmy's rigorous data filtering exemplify the iterative process of refining data collection and reward design. Ultimately, advances in RLVR are driven by balancing algorithmic innovation with practical considerations of data availability, computational efficiency, and the robustness of the reward signal.
Mentioned in This Episode
●Software & Apps
●Companies
●Books
Common Questions
RLHF faces limitations like annotation bottlenecks and overfitting of reward models, which can limit its potential for achieving optimal performance, especially in complex domains.
Topics
Mentioned in this video
Mentioned as an example of what is achieved with instruction tuning and RLHF, and later as an example of an interface that allows early termination of thinking.
Mentioned as a platform where PPO was used to train agents to walk and interact.
A model developed by DeepSeek that effectively used GRPO for math problems, matching OpenAI's previous performance.
A model that demonstrates agentic RLVR training, using SWEBench and other methods to achieve strong performance.
A formal math language compiler discussed as an example of a system that, despite seeming robust, has adversarial vulnerabilities.
More from Stanford Online
View all 67 summaries
66 minStanford CS153 Frontier Systems | The Road Ahead: Resilience Required
102 minStanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation
85 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 14: Data
47 minStanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Infrastructure, Capstone Case
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free