GRPO

Concept

Group Relative Policy Optimization; a cheap, scalable training method that compares multiple student-generated answers to select the best one instead of grading every sentence by a separate teacher model.

Mentioned in 5 videos

Save the 5 videos on GRPO to your own pod.

Get Started Free

Videos Mentioning GRPO

New DeepSeek Research - The Future Is Here!

Two Minute Papers

⚡️Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect

Latent Space

A reinforcement learning algorithm that Will Brown has published work on, particularly in relation to format reward and its use in multi-turn RL.

The #1 SWE-Bench Verified Agent

Latent Space

A variation of DPO that is currently popular in AI research, particularly in the context of reinforcement learning for models.

Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)

Latent Space

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 7 - Evaluation

Stanford Online

A method aimed at capturing negative user signals, teaching models what not to do, as part of preference tuning in post-training.