Search Results for author: Tom Lieberum

Found 6 papers, 1 papers with code

Improving Dictionary Learning with Gated Sparse Autoencoders

no code implementations24 Apr 2024 Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations.

Dictionary Learning

AtP*: An efficient and scalable method for localizing LLM behaviour to components

no code implementations1 Mar 2024 János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.

Progress measures for grokking via mechanistic interpretability

1 code implementation12 Jan 2023 Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.

Memorization

Cannot find the paper you are looking for? You can Submit a new open access paper.