Policy Optimization; traditional training approach using a large teacher model to critique data.
Two Minute Papers