Alignment Faking in Large Language Models

Study / Research

A paper that investigates whether current large language models, when placed in a training situation with different objectives, will fake alignment to avoid goal modification.

Mentioned in 1 video

Videos Mentioning Alignment Faking in Large Language Models

Ai Will Try to Cheat & Escape (aka Rob Miles was Right!) - Computerphile

Computerphile

A paper that investigates whether current large language models, when placed in a training situation with different objectives, will fake alignment to avoid goal modification.