GRPO
ConceptMentioned in 4 videos
Group Relative Policy Optimization; a cheap, scalable training method that compares multiple student-generated answers to select the best one instead of grading every sentence by a separate teacher model.
Videos Mentioning GRPO

New DeepSeek Research - The Future Is Here!
Two Minute Papers
Group Relative Policy Optimization; a cheap, scalable training method that compares multiple student-generated answers to select the best one instead of grading every sentence by a separate teacher model.

Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)
Latent Space

⚡️Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect
Latent Space
A reinforcement learning algorithm that Will Brown has published work on, particularly in relation to format reward and its use in multi-turn RL.

The #1 SWE-Bench Verified Agent
Latent Space
A variation of DPO that is currently popular in AI research, particularly in the context of reinforcement learning for models.