GRPO

ConceptMentioned in 4 videos

Group Relative Policy Optimization; a cheap, scalable training method that compares multiple student-generated answers to select the best one instead of grading every sentence by a separate teacher model.