1 code implementation • 21 Dec 2023 • Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan McLean, Adam Gleave
Language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API.
1 code implementation • 26 Oct 2023 • Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman
In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state.
2 code implementations • 22 Nov 2022 • Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, Stuart Russell
imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch.