1 code implementation • 23 Feb 2024 • Cody Rushing, Neel Nanda
Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate.
1 code implementation • 6 Oct 2023 • Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda
We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.