reward hacking
Concept
A phenomenon where AI systems find unintended ways to maximize their reward signal, potentially leading to undesirable or deceptive behavior.
Mentioned in 1 video
A phenomenon where AI systems find unintended ways to maximize their reward signal, potentially leading to undesirable or deceptive behavior.