Search Results for author: Varshini Subhash

Found 3 papers, 0 papers with code

Why do universal adversarial attacks work on large language models?: Geometry might be the answer

no code implementations • 1 Sep 2023 • Varshini Subhash, Anna Bialas, Weiwei Pan, Finale Doshi-Velez

We believe this new geometric perspective on the underlying mechanism driving universal attacks could help us gain deeper insight into the internal workings and failure modes of LLMs, thus enabling their mitigation.

Dimensionality Reduction

Paper
Add Code

Can Large Language Models Change User Preference Adversarially?

no code implementations • 5 Jan 2023 • Varshini Subhash

Pretrained large language models (LLMs) are becoming increasingly powerful and ubiquitous in mainstream applications such as being a personal assistant, a dialogue model, etc.

Paper
Add Code

What Makes a Good Explanation?: A Harmonized View of Properties of Explanations

no code implementations • 10 Nov 2022 • Zixi Chen, Varshini Subhash, Marton Havasi, Weiwei Pan, Finale Doshi-Velez

In this work, we survey properties defined in interpretable machine learning papers, synthesize them based on what they actually measure, and describe the trade-offs between different formulations of these properties.

Interpretable Machine Learning

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.