no code implementations • 1 Sep 2023 • Varshini Subhash, Anna Bialas, Weiwei Pan, Finale Doshi-Velez
We believe this new geometric perspective on the underlying mechanism driving universal attacks could help us gain deeper insight into the internal workings and failure modes of LLMs, thus enabling their mitigation.
no code implementations • 5 Jan 2023 • Varshini Subhash
Pretrained large language models (LLMs) are becoming increasingly powerful and ubiquitous in mainstream applications such as being a personal assistant, a dialogue model, etc.
no code implementations • 10 Nov 2022 • Zixi Chen, Varshini Subhash, Marton Havasi, Weiwei Pan, Finale Doshi-Velez
In this work, we survey properties defined in interpretable machine learning papers, synthesize them based on what they actually measure, and describe the trade-offs between different formulations of these properties.