Search Results for author: Joshua Clymer

Found 3 papers, 1 papers with code

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

no code implementations8 May 2024 Joshua Clymer, Caden Juang, Severin Field

Like a criminal under investigation, Large Language Models (LLMs) might pretend to be aligned while evaluated and misbehave when they have a good opportunity.

Safety Cases: How to Justify the Safety of Advanced AI Systems

no code implementations15 Mar 2024 Joshua Clymer, Nick Gabrieli, David Krueger, Thomas Larsen

To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe.

Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains

1 code implementation13 Nov 2023 Joshua Clymer, Garrett Baker, Rohan Subramani, Sam Wang

As AI systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how LLMs generalize human feedback to situations where it is unreliable.

Instruction Following

Cannot find the paper you are looking for? You can Submit a new open access paper.