Search Results for author: Can Rager

Found 5 papers, 4 papers with code

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

1 code implementation • 28 Mar 2024 • Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, Aaron Mueller

We introduce methods for discovering and applying sparse feature circuits.

Language Modelling

Paper
Code

Structured World Representations in Maze-Solving Transformers

1 code implementation • 5 Dec 2023 • Michael Igorevich Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine, Cecilia Diniz Behn, Katsumi Inoue, Samy Wu Fung

Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers.

valid

Paper
Code

Attribution Patching Outperforms Automated Circuit Discovery

1 code implementation • 16 Oct 2023 • Aaquib Syed, Can Rager, Arthur Conmy

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models.

Paper
Code

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

no code implementations • 11 Oct 2023 • James Dao, Yeu-Tong Lau, Can Rager, Jett Janiak

That is, clearing residual stream directions set by earlier layers by reading in information and writing out the negative version.

Management

Paper
Add Code

A Configurable Library for Generating and Manipulating Maze Datasets

1 code implementation • 19 Sep 2023 • Michael Igorevich Ivanitskiy, Rusheb Shah, Alex F. Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, Samy Wu Fung

Understanding how machine learning models respond to distributional shifts is a key research challenge.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.