reward hacking

ConceptMentioned in 1 video

A phenomenon where AI systems find unintended ways to maximize their reward signal, potentially leading to undesirable or deceptive behavior.