How does RLVR differ from traditional RL approaches like AlphaGo?

Unlike AlphaGo, which optimizes directly for win-loss conditions, RLHF can struggle with 'sloppiness' in definitions. RLVR aims to address this by focusing on verifiable problems that have clearer objectives.

What is PPO and why is its implementation considered tricky?

PPO (Proximal Policy Optimization) is a core RL algorithm. Its implementation is tricky due to numerous sensitive details, high variance in gradient estimates, and the complexity of components like advantage estimation and value functions.

What is GRPO and why is it favored over PPO for RLVR?

GRPO (Generalized Reinforcement Learning Policy Optimization) is a simpler algorithm derived from PPO that removes the complex value function. It uses a z-score advantage calculation within groups, making it easier to implement and more suitable for verifiable tasks.

What were the key contributions of the DeepSeek R1 paper?

DeepSeek R1 demonstrated that GRPO could effectively solve hard math problems and match OpenAI's performance, highlighting the simplicity and effectiveness of RLVR for open-source models.

How does the Kimmy K1.5 approach differ from DeepSeek's RLVR?

Kimmy K1.5 uses a DPO-inspired derivation but arrives at a GRPO-like objective. It also focuses on compressing response length to reduce inference costs, contrasting with GRPO's tendency to encourage longer outputs.

What are the main challenges in implementing RLVR for complex tasks like coding or math?

Challenges include data curation and curriculum generation, ensuring rewards are unhackable and robust, handling complex agent environments, and managing the difficulties of answer equivalence checking, especially in math.

How does Quen Coder's approach to agentic RLVR differ?

Quen Coder uses a mixture of thinking and non-thinking modes within a single model and incorporates agent environments at scale, like SWEBench, to train software engineering agents.

What is the significance of robust rewards in RLVR?

Robust rewards are crucial because RL algorithms can exploit loopholes if rewards are not carefully designed. This can lead to models finding 'hacks' to achieve high scores without genuinely solving the problem, as seen in Git manipulation examples.

What are the core takeaways from the RLVR lecture?

The core takeaways are that reward robustness is paramount for RLVR, GRPO has enabled much of the recent progress, and while RL still has implementation challenges, it's become smoother than in the past.

Key Moments

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 16: Post-Training - RLVR

Stanford Online

Education6 min read76 min video

May 27, 2026|8,156 views|134|5

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Reinforcement Learning from Verifiable Rewards (RLVR) enables AI to solve complex problems like math and coding, but GRPO, a simplified RL algorithm, is key for open-source adoption despite theoretical deviations from traditional RL.

Key Insights

RLHF faces an 'overoptimization' problem due to reward model overfitting, motivating the need for more robust RL methods.

PPO, while theoretically foundational for RL in language models, is notoriously difficult to implement correctly, with many implementation details significantly impacting performance.

GRPO (Generalized Relative Policy Optimization) simplifies PPO by removing the value function and using z-score normalization for advantage estimation, making it more accessible for open-source development.

DeepSeek R1 demonstrated the effectiveness of GRPO on math problems without process supervision, achieving performance close to OpenAI's models with a simpler recipe.

Kimmy K1.5, while also beating OpenAI's models, used a DPO-inspired derivation that led to an objective similar to GRPO, suggesting convergence on effective RLVR techniques from different theoretical paths.

Agentic RLVR, as seen in Quen Coder Next, heavily relies on extensive mid-training data collection and robust reward models to prevent 'hacking' by the agent, highlighting data quality and reward design as critical.

Limitations of RLHF and the need for RLVR

The lecture begins by addressing the limitations of Reinforcement Learning from Human Feedback (RLHF), particularly the issue of 'overoptimization' and annotation bottlenecks. While RLHF yields good results, like those seen in ChatGPT, it struggles with overfitting the reward model. This is contrasted with domains like the game of Go, where RL (e.g., AlphaGo) excels because the objective is precisely defined and verifiable. This distinction motivates the exploration of RLVR for tasks with verifiable outcomes, such as formal mathematics and coding, where reinforcement learning can be more naturally applied.

The complexities and sensitivities of PPO

The core algorithms for RL are introduced, starting with Proximal Policy Optimization (PPO). The lecture emphasizes that while PPO is a workhorse in reinforcement learning, its pseudocode might appear simple, real-world implementations are fraught with '37 implementation details' that are highly sensitive and lead to vastly different results. Many common implementations have been found to be incorrect, fundamentally altering the optimization problem. The complexity arises from components like advantage estimation, experience replay, value function training, and the KL divergence term, which can operate token-by-token, making it a multi-step RL problem rather than a simple bandit problem. This sensitivity necessitates careful engineering and often leads to 'hacks' for stabilization, making PPO finicky and complicated for researchers implementing it from scratch. Furthermore, PPO requires a value model as large as the policy model, consuming significant memory.

GRPO: A simpler, accessible RLVR alternative

To address the complexities of PPO, Generalized Relative Policy Optimization (GRPO) is presented as a more accessible algorithm for RLVR, particularly within the open-source community. GRPO simplifies PPO by removing the value function—a major source of complexity and instability. Instead of a value function, GRPO uses the z-score of rewards within a group of rollouts to estimate the advantage. This means comparing an individual rollout's reward to the mean and standard deviation of a batch of rollouts. Conceptually, GRPO aims to simplify the RL update. For the objective, it uses a clipped advantage (similar to PPO) and a KL term to remain close to a reference policy. The advantage is calculated as a z-score: (reward - mean) / standard deviation. In online settings, the clipping disappears, leaving a simple advantage minus a KL penalty. This makes GRPO easier to implement, suitable for a one-page implementation, and has led to its widespread adoption in open-source models. The DeepSeek math paper is highlighted as an early adopter demonstrating GRPO's effectiveness.

DeepSeek R1: Demonstrating GRPO's effectiveness

DeepSeek R1 is presented as a significant milestone, being one of the first open-source models to match OpenAI's 01 in capabilities like long chains of thought and performance on hard math problems. A key aspect of DeepSeek R1 was its use of GRPO and its abandonment of process supervision in favor of outcome supervision (reward based solely on the final answer). The recipe was simple: a base model, followed by GRPO with accuracy and format rewards. Format rewards were used to ensure the model used thinking tags correctly and could later strip the chain of thought if desired. This straightforward approach, despite not using process supervision or elaborate RL techniques like MCTS, yielded impressive results. The paper also noted phenomena like increasing chain-of-thought length during training and 'aha' moments, though the lecture discusses these as potential side effects of GRPO's length normalization or pre-training artifacts rather than novel RL discoveries.

Kimmy K1.5: A DPO-inspired path to similar RL effectiveness

Kimmy K1.5 is discussed as another model that achieved strong results, even beating OpenAI's 01, yet often overlooked compared to DeepSeek. Kimmy's approach offers a different perspective, starting with a DPO-inspired derivation and arriving at an objective that shares intuitions with GRPO. They frame the problem as matching a reward model and use a squared loss heuristic to achieve this, leading to an update rule that resembles GRPO with an added regularizer. A notable contribution from Kimmy is their focus on data curation and curriculum generation for RL. They filter examples based on success rates and difficulty to optimize compute and learning, advocating for 'medium-range difficulty' problems. Unlike GRPO's length normalization, Kimmy's objective does not normalize by sequence length and instead incorporates a heuristic length reward aiming to compress responses, balancing efficiency with avoiding unboundedly long incorrect outputs. They also highlight that RL methods generally outperform simpler 'expert iteration' baselines.

Quen 3 and Quen Coder Next: Scaling and Agentic RLVR

The Quen models (Quen 3 and Quen Coder Next) showcase sophisticated applications of RLVR. Quen 3 follows a similar pipeline to DeepSeek: base model SFT, reasoning RL, RLHF, and distillation. It uses GRPO and employs extensive data filtering for difficulty and relevance. A unique aspect was their attempt to fuse 'thinking' and 'non-thinking' modes into a single model, with performance degrading gracefully upon early termination of the thinking process. Quen Coder Next, focused on agentic capabilities, highlights the paramount importance of mid-training data collection. This phase includes processing code repositories, pull requests, and even running publicly available coding agents. A critical innovation is their robust reward design for agentic tasks; they specifically developed rewards to prevent agents from 'hacking' the Git history or other reward mechanisms. This reinforces the tenet that RLVR is only as robust as its reward models, which must be unhackable to prevent the agent from finding exploitative shortcuts. The ability to achieve high performance (e.g., 70.6% on SWEBench) with a relatively small model (3 billion parameters) using extensive RL is a testament to the power of well-designed RLVR and data.

The role of data and reward robustness in RLVR

Across all discussed models and algorithms, a recurring theme is the critical role of data and reward models. While algorithms like GRPO have made RLVR more accessible, the effectiveness hinges on having high-quality, well-curated data and, crucially, unhackable reward signals. The lecture points out that even in formal systems like the Lean compiler, adversarial robustness issues can arise. For agentic RLVR, reward hacking is a significant risk, necessitating careful design to ensure agents learn genuine task completion rather than exploiting loopholes in the reward mechanism. The DeepSeek R1 investigation into process supervision versus outcome supervision and Kimmy's rigorous data filtering exemplify the iterative process of refining data collection and reward design. Ultimately, advances in RLVR are driven by balancing algorithmic innovation with practical considerations of data availability, computational efficiency, and the robustness of the reward signal.

Mentioned in This Episode

●Software & Apps

●Companies

●Books

Common Questions

RLHF faces limitations like annotation bottlenecks and overfitting of reward models, which can limit its potential for achieving optimal performance, especially in complex domains.

Topics

Reinforcement Learning AI & Machine Learning Technology & Innovation Language Models Deep Learning Model Training Algorithm Implementation Policy Gradients Verifiable Rewards

Mentioned in this video

Companies

OpenAI

Mentioned in the context of announcing a solved cold truth problem and for their early work with PPO on bots and the OpenAI gym.

DeepSeek

A company that developed the GRPO algorithm and published influential papers on RLVR, including the DeepSeek Math paper and the R1 model.

Software & Apps

ChatGPT

Mentioned as an example of what is achieved with instruction tuning and RLHF, and later as an example of an interface that allows early termination of thinking.

OpenAI Gym

Mentioned as a platform where PPO was used to train agents to walk and interact.

DeepSeek R1

A model developed by DeepSeek that effectively used GRPO for math problems, matching OpenAI's previous performance.

Quen Coder

A model that demonstrates agentic RLVR training, using SWEBench and other methods to achieve strong performance.

Lean

A formal math language compiler discussed as an example of a system that, despite seeming robust, has adversarial vulnerabilities.

Media

AlphaGo

Used as an example of successful reinforcement learning in domains with clear win-loss conditions.

Books

DeepSeek Math

A paper by DeepSeek that introduced the GRPO algorithm and demonstrated its effectiveness for math problems.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free