1 code implementation • 28 Mar 2024 • Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, Aaron Mueller
We introduce methods for discovering and applying sparse feature circuits.
1 code implementation • 5 Dec 2023 • Michael Igorevich Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine, Cecilia Diniz Behn, Katsumi Inoue, Samy Wu Fung
Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers.
1 code implementation • 16 Oct 2023 • Aaquib Syed, Can Rager, Arthur Conmy
Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models.
no code implementations • 11 Oct 2023 • James Dao, Yeu-Tong Lau, Can Rager, Jett Janiak
That is, clearing residual stream directions set by earlier layers by reading in information and writing out the negative version.
1 code implementation • 19 Sep 2023 • Michael Igorevich Ivanitskiy, Rusheb Shah, Alex F. Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, Samy Wu Fung
Understanding how machine learning models respond to distributional shifts is a key research challenge.