Search Results for author: Tom Lieberum

Found 6 papers, 1 papers with code

Improving Dictionary Learning with Gated Sparse Autoencoders

no code implementations • 24 Apr 2024 • Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.

Dictionary Learning

Paper
Add Code

Evaluating Frontier Models for Dangerous Capabilities

no code implementations • 20 Mar 2024 • Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, Toby Shevlane

To understand the risks posed by a new AI system, we must understand what it can and cannot do.

Paper
Add Code

AtP*: An efficient and scalable method for localizing LLM behaviour to components

no code implementations • 1 Mar 2024 • János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.

Paper
Add Code

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

no code implementations • 18 Jul 2023 • Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik

\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models.

Multiple-choice Question Answering

Paper
Add Code

Progress measures for grokking via mechanistic interpretability

1 code implementation • 12 Jan 2023 • Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.

Memorization

Paper
Code

Retrospective on the 2021 BASALT Competition on Learning from Human Feedback

no code implementations • 14 Apr 2022 • Rohin Shah, Steven H. Wang, Cody Wild, Stephanie Milani, Anssi Kanervisto, Vinicius G. Goecks, Nicholas Waytowich, David Watkins-Valls, Bharat Prakash, Edmund Mills, Divyansh Garg, Alexander Fries, Alexandra Souly, Chan Jun Shern, Daniel del Castillo, Tom Lieberum

The goal of the competition was to promote research towards agents that use learning from human feedback (LfHF) techniques to solve open-world tasks.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.