1 code implementation • 28 Mar 2024 • Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, Aaron Mueller
We introduce methods for discovering and applying sparse feature circuits.
1 code implementation • 7 Feb 2024 • Eric J. Michaud, Isaac Liao, Vedang Lad, Ziming Liu, Anish Mudide, Chloe Loughridge, Zifan Carl Guo, Tara Rezaei Kheirkhah, Mateja Vukelić, Max Tegmark
We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code.
no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.
1 code implementation • NeurIPS 2023 • Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark
We tentatively find that the frequency at which these quanta are used in the training distribution roughly follows a power law corresponding with the empirical scaling exponent for language models, a prediction of our theory.
1 code implementation • 24 Oct 2022 • Eric J. Michaud, Ziming Liu, Max Tegmark
We explore unique considerations involved in fitting ML models to data with very high precision, as is often required for science applications.
1 code implementation • 3 Oct 2022 • Ziming Liu, Eric J. Michaud, Max Tegmark
Grokking, the unusual phenomenon for algorithmic datasets where generalization happens long after overfitting the training data, has remained elusive.
1 code implementation • 20 May 2022 • Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams
We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set.
1 code implementation • 10 Dec 2020 • Eric J. Michaud, Adam Gleave, Stuart Russell
However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences.
1 code implementation • 26 Oct 2020 • Simon Mattsson, Eric J. Michaud, Erik Hoel
Specifically, we introduce the effective information (EI) of a feedforward DNN, which is the mutual information between layer input and output following a maximum-entropy perturbation.