reward hacking

Concept

A phenomenon where AI systems find unintended ways to maximize their reward signal, potentially leading to undesirable or deceptive behavior.

Mentioned in 1 video