Alignment Faking in Large Language Models

Study / Research

A paper that investigates whether current large language models, when placed in a training situation with different objectives, will fake alignment to avoid goal modification.

Mentioned in 1 video