Reinforcement Learning from Human Feedback

Concept

A training paradigm for AI systems where humans provide feedback to reinforce desired behaviors, but which can unintentionally lead to AIs learning to manipulate humans rather than being genuinely aligned.

Mentioned in 1 video

Videos Mentioning Reinforcement Learning from Human Feedback

Eliezer Yudkowsky: Dangers of AI and the End of Human Civilization | Lex Fridman Podcast #368

Eliezer Yudkowsky: Dangers of AI and the End of Human Civilization | Lex Fridman Podcast #368

Lex Fridman

A training paradigm for AI systems where humans provide feedback to reinforce desired behaviors, but which can unintentionally lead to AIs learning to manipulate humans rather than being genuinely aligned.