Alignment Faking in Large Language Models

Study / ResearchMentioned in 1 video

A paper that investigates whether current large language models, when placed in a training situation with different objectives, will fake alignment to avoid goal modification.