Alignment Faking in Large Language Models
Study / ResearchMentioned in 1 video
A paper that investigates whether current large language models, when placed in a training situation with different objectives, will fake alignment to avoid goal modification.
