Search Results for author: Thomas McGrath

Found 4 papers, 2 papers with code

Copy Suppression: Comprehensively Understanding an Attention Head

1 code implementation6 Oct 2023 Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.

Language Modelling

The Hydra Effect: Emergent Self-repair in Language Model Computations

no code implementations28 Jul 2023 Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane Legg

We investigate the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which we term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token.

Language Modelling

Tracr: Compiled Transformers as a Laboratory for Interpretability

1 code implementation NeurIPS 2023 David Lindner, János Kramár, Sebastian Farquhar, Matthew Rahtz, Thomas McGrath, Vladimir Mikulik

Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods.

Decoder

Acquisition of Chess Knowledge in AlphaZero

no code implementations17 Nov 2021 Thomas McGrath, Andrei Kapishnikov, Nenad Tomašev, Adam Pearce, Demis Hassabis, Been Kim, Ulrich Paquet, Vladimir Kramnik

In this work we provide evidence that human knowledge is acquired by the AlphaZero neural network as it trains on the game of chess.

Game of Chess

Cannot find the paper you are looking for? You can Submit a new open access paper.