Search Results for author: Thomas McGrath

Found 4 papers, 2 papers with code

Copy Suppression: Comprehensively Understanding an Attention Head

1 code implementation • 6 Oct 2023 • Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task.

Language Modelling

Paper
Code

The Hydra Effect: Emergent Self-repair in Language Model Computations

no code implementations • 28 Jul 2023 • Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane Legg

We investigate the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which we term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token.

Language Modelling