GRPO

Concept

Group Relative Policy Optimization; a cheap, scalable training method that compares multiple student-generated answers to select the best one instead of grading every sentence by a separate teacher model.

Mentioned in 4 videos

Videos Mentioning GRPO

New DeepSeek Research - The Future Is Here!

Two Minute Papers

⚡️Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect

Latent Space

A reinforcement learning algorithm that Will Brown has published work on, particularly in relation to format reward and its use in multi-turn RL.

The #1 SWE-Bench Verified Agent

Latent Space

A variation of DPO that is currently popular in AI research, particularly in the context of reinforcement learning for models.

Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)

Latent Space