Direct Preference Optimization

Concept

An RL-free approach that directly trains models to prefer human outputs without requiring a separate reward model.

Mentioned in 1 video