Search Results for author: Severin Field

Found 1 papers, 0 papers with code

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

no code implementations8 May 2024 Joshua Clymer, Caden Juang, Severin Field

Like a criminal under investigation, Large Language Models (LLMs) might pretend to be aligned while evaluated and misbehave when they have a good opportunity.

Cannot find the paper you are looking for? You can Submit a new open access paper.