PO
Concept
Policy Optimization; traditional training approach using a large teacher model to critique data.
Mentioned in 2 videos
Videos Mentioning PO

New DeepSeek Research - The Future Is Here!
Two Minute Papers
Policy Optimization; traditional training approach using a large teacher model to critique data.

⚡️Multi-Turn RL for Multi-Hour Agents — with Will Brown, Prime Intellect
Latent Space
An older reinforcement learning algorithm, mentioned as the basis for RHF and contrasted with GRPO in a discussion about memory efficiency and gradient syncing.