Direct Preference Optimization
Concept
An RL-free approach that directly trains models to prefer human outputs without requiring a separate reward model.
Mentioned in 1 video
An RL-free approach that directly trains models to prefer human outputs without requiring a separate reward model.