DPO

Software / App

Direct Preference Optimization, a simpler RLHF algorithm that eliminates the need for a separate reward model and on-policy sampling. It works by taking gradient steps towards preferred responses and negative steps away from dispreferred ones.

Mentioned in 1 video

Videos Mentioning DPO

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training

Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 15: Mid/Post-Training

Stanford Online

Direct Preference Optimization, a simpler RLHF algorithm that eliminates the need for a separate reward model and on-policy sampling. It works by taking gradient steps towards preferred responses and negative steps away from dispreferred ones.