Search Results for author: Eric J. Michaud

Found 11 papers, 9 papers with code

Survival of the Fittest Representation: A Case Study with Modular Addition

no code implementations • 27 May 2024 • Xiaoman Delores Ding, Zifan Carl Guo, Eric J. Michaud, Ziming Liu, Max Tegmark

To investigate this Survival of the Fittest hypothesis, we conduct a case study on neural networks performing modular addition, and find that these networks' multiple circular representations at different Fourier frequencies undergo such competitive dynamics, with only a few circles surviving at the end.

Paper
Add Code

Not All Language Model Features Are Linear

1 code implementation • 23 May 2024 • Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark

Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space.

Language Modelling

Paper
Code

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

1 code implementation • 28 Mar 2024 • Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, Aaron Mueller

We introduce methods for discovering and applying sparse feature circuits.

Language Modelling

Paper
Code

Opening the AI black box: program synthesis via mechanistic interpretability

1 code implementation • 7 Feb 2024 • Eric J. Michaud, Isaac Liao, Vedang Lad, Ziming Liu, Anish Mudide, Chloe Loughridge, Zifan Carl Guo, Tara Rezaei Kheirkhah, Mateja Vukelić, Max Tegmark

We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code.

Program Synthesis Symbolic Regression

Paper
Code

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.

reinforcement-learning

Paper
Add Code

The Quantization Model of Neural Scaling

1 code implementation • NeurIPS 2023 • Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark

We tentatively find that the frequency at which these quanta are used in the training distribution roughly follows a power law corresponding with the empirical scaling exponent for language models, a prediction of our theory.

Language Modelling Quantization

Paper
Code

Precision Machine Learning

1 code implementation • 24 Oct 2022 • Eric J. Michaud, Ziming Liu, Max Tegmark

We explore unique considerations involved in fitting ML models to data with very high precision, as is often required for science applications.

Paper
Code

Omnigrok: Grokking Beyond Algorithmic Data

1 code implementation • 3 Oct 2022 • Ziming Liu, Eric J. Michaud, Max Tegmark

Grokking, the unusual phenomenon for algorithmic datasets where generalization happens long after overfitting the training data, has remained elusive.

Attribute Representation Learning

Paper
Code

Towards Understanding Grokking: An Effective Theory of Representation Learning

1 code implementation • 20 May 2022 • Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams

We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set.

Memorization Representation Learning

Paper
Code

Understanding Learned Reward Functions

1 code implementation • 10 Dec 2020 • Eric J. Michaud, Adam Gleave, Stuart Russell

However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences.

Paper
Code

Examining the causal structures of deep neural networks using information theory

1 code implementation • 26 Oct 2020 • Simon Mattsson, Eric J. Michaud, Erik Hoel

Specifically, we introduce the effective information (EI) of a feedforward DNN, which is the mutual information between layer input and output following a maximum-entropy perturbation.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.