GRPO
Group Relative Policy Optimization; a cheap, scalable training method that compares multiple student-generated answers to select the best one instead of grading every sentence by a separate teacher model.
Save the 4 videos on GRPO to your own pod.
Sign up free to keep building your knowledge base on GRPO as more episodes are added.
Videos Mentioning GRPO

New DeepSeek Research - The Future Is Here!
Two Minute Papers
Group Relative Policy Optimization; a cheap, scalable training method that compares multiple student-generated answers to select the best one instead of grading every sentence by a separate teacher model.

⚡️Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect
Latent Space
A reinforcement learning algorithm that Will Brown has published work on, particularly in relation to format reward and its use in multi-turn RL.

The #1 SWE-Bench Verified Agent
Latent Space
A variation of DPO that is currently popular in AI research, particularly in the context of reinforcement learning for models.

Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)
Latent Space