Search Results for author: János Kramár

Found 9 papers, 5 papers with code

AtP*: An efficient and scalable method for localizing LLM behaviour to components

no code implementations1 Mar 2024 János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.

Explaining grokking through circuit efficiency

no code implementations5 Sep 2023 Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, Ramana Kumar

One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation.

Tracr: Compiled Transformers as a Laboratory for Interpretability

1 code implementation NeurIPS 2023 David Lindner, János Kramár, Sebastian Farquhar, Matthew Rahtz, Thomas McGrath, Vladimir Mikulik

Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods.

Learning Reciprocity in Complex Sequential Social Dilemmas

no code implementations19 Mar 2019 Tom Eccles, Edward Hughes, János Kramár, Steven Wheelwright, Joel Z. Leibo

We analyse the resulting policies to show that the reciprocating agents are strongly influenced by their co-players' behavior.

Cannot find the paper you are looking for? You can Submit a new open access paper.