Search Results for author: James Dao

Found 2 papers, 0 papers with code

Position Paper: Toward New Frameworks for Studying Model Representations

no code implementations6 Feb 2024 Satvik Golechha, James Dao

Mechanistic interpretability (MI) aims to understand AI models by reverse-engineering the exact algorithms neural networks learn.

Position

An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

no code implementations11 Oct 2023 James Dao, Yeu-Tong Lau, Can Rager, Jett Janiak

That is, clearing residual stream directions set by earlier layers by reading in information and writing out the negative version.

Management

Cannot find the paper you are looking for? You can Submit a new open access paper.