Search Results for author: Rusheb Shah

Found 3 papers, 2 papers with code

Structured World Representations in Maze-Solving Transformers

1 code implementation5 Dec 2023 Michael Igorevich Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine, Cecilia Diniz Behn, Katsumi Inoue, Samy Wu Fung

Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers.

valid

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

no code implementations6 Nov 2023 Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando

Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour.

Language Modelling

Cannot find the paper you are looking for? You can Submit a new open access paper.