Search Results for author: Lee Sharkey

Found 7 papers, 3 papers with code

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

1 code implementation • 17 May 2024 • Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey

We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted.

Dictionary Learning

Paper
Code

Black-Box Access is Insufficient for Rigorous AI Audits

no code implementations • 25 Jan 2024 • Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell

External audits of AI systems are increasingly recognized as a key mechanism for AI governance.

Paper
Add Code

Sparse Autoencoders Find Highly Interpretable Features in Language Models

2 code implementations • 15 Sep 2023 • Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey

One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons.

counterfactual Language Modelling +1

105

Paper
Code

A technical note on bilinear layers for interpretability

no code implementations • 5 May 2023 • Lee Sharkey

The ability of neural networks to represent more features than neurons makes interpreting them challenging.

Paper
Add Code

Circumventing interpretability: How to defeat mind-readers

no code implementations • 21 Dec 2022 • Lee Sharkey

The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values.

Paper
Add Code

Interpreting Neural Networks through the Polytope Lens

no code implementations • 22 Nov 2022 • Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy

Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned.

Paper
Add Code

Goal Misgeneralization in Deep Reinforcement Learning

4 code implementations • 28 May 2021 • Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, David Krueger

We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL).

Navigate Out-of-Distribution Generalization +2

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.