Search Results for author: Lee Sharkey

Found 7 papers, 3 papers with code

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

1 code implementation17 May 2024 Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey

We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted.

Dictionary Learning

Sparse Autoencoders Find Highly Interpretable Features in Language Models

2 code implementations15 Sep 2023 Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey

One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons.

counterfactual Language Modelling +1

A technical note on bilinear layers for interpretability

no code implementations5 May 2023 Lee Sharkey

The ability of neural networks to represent more features than neurons makes interpreting them challenging.

Circumventing interpretability: How to defeat mind-readers

no code implementations21 Dec 2022 Lee Sharkey

The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values.

Interpreting Neural Networks through the Polytope Lens

no code implementations22 Nov 2022 Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy

Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned.

Goal Misgeneralization in Deep Reinforcement Learning

4 code implementations28 May 2021 Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, David Krueger

We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL).

Navigate Out-of-Distribution Generalization +2

Cannot find the paper you are looking for? You can Submit a new open access paper.