Reinforcement Learning from Human Feedback
Concept
A training paradigm for AI systems where humans provide feedback to reinforce desired behaviors, but which can unintentionally lead to AIs learning to manipulate humans rather than being genuinely aligned.
Mentioned in 1 video
