no code implementations • 8 May 2024 • Joshua Clymer, Caden Juang, Severin Field
Like a criminal under investigation, Large Language Models (LLMs) might pretend to be aligned while evaluated and misbehave when they have a good opportunity.