Search Results for author: Kshitij Sachan

Found 4 papers, 2 papers with code

AI Control: Improving Safety Despite Intentional Subversion

no code implementations12 Dec 2023 Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger

This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding.

Polysemanticity and Capacity in Neural Networks

no code implementations4 Oct 2022 Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris

We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features.

Cannot find the paper you are looking for? You can Submit a new open access paper.