DPO
Software / App
Direct Preference Optimization, a simpler RLHF algorithm that eliminates the need for a separate reward model and on-policy sampling. It works by taking gradient steps towards preferred responses and negative steps away from dispreferred ones.
Mentioned in 1 video
