Search Results for author: Janos Kramar

Found 5 papers, 0 papers with code

The Hydra Effect: Emergent Self-repair in Language Model Computations

no code implementations28 Jul 2023 Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane Legg

We investigate the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which we term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token.

Language Modelling

Power-seeking can be probable and predictive for trained agents

no code implementations13 Apr 2023 Victoria Krakovna, Janos Kramar

Power-seeking behavior is a key source of risk from advanced AI, but our theoretical understanding of this phenomenon is relatively limited.

Neural Design of Contests and All-Pay Auctions using Multi-Agent Simulation

no code implementations25 Sep 2019 Thomas Anthony, Ian Gemp, Janos Kramar, Tom Eccles, Andrea Tacchetti, Yoram Bachrach

In contrast to auctions designed manually by economists, our method searches the possible design space using a simulation of the multi-agent learning process, and can thus handle settings where a game-theoretic equilibrium analysis is not tractable.

Guidelines for Artificial Intelligence Containment

no code implementations24 Jul 2017 James Babcock, Janos Kramar, Roman V. Yampolskiy

With almost daily improvements in capabilities of artificial intelligence it is more important than ever to develop safety software for use by the AI research community.

The AGI Containment Problem

no code implementations2 Apr 2016 James Babcock, Janos Kramar, Roman Yampolskiy

There is considerable uncertainty about what properties, capabilities and motivations future AGIs will have.

Cannot find the paper you are looking for? You can Submit a new open access paper.