no code implementations • 8 May 2024 • Joshua Clymer, Caden Juang, Severin Field
Like a criminal under investigation, Large Language Models (LLMs) might pretend to be aligned while evaluated and misbehave when they have a good opportunity.
no code implementations • 15 Mar 2024 • Joshua Clymer, Nick Gabrieli, David Krueger, Thomas Larsen
To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe.
1 code implementation • 13 Nov 2023 • Joshua Clymer, Garrett Baker, Rohan Subramani, Sam Wang
As AI systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how LLMs generalize human feedback to situations where it is unreliable.