Search Results for author: Varshini Subhash

Found 3 papers, 0 papers with code

Why do universal adversarial attacks work on large language models?: Geometry might be the answer

no code implementations1 Sep 2023 Varshini Subhash, Anna Bialas, Weiwei Pan, Finale Doshi-Velez

We believe this new geometric perspective on the underlying mechanism driving universal attacks could help us gain deeper insight into the internal workings and failure modes of LLMs, thus enabling their mitigation.

Dimensionality Reduction

Can Large Language Models Change User Preference Adversarially?

no code implementations5 Jan 2023 Varshini Subhash

Pretrained large language models (LLMs) are becoming increasingly powerful and ubiquitous in mainstream applications such as being a personal assistant, a dialogue model, etc.

What Makes a Good Explanation?: A Harmonized View of Properties of Explanations

no code implementations10 Nov 2022 Zixi Chen, Varshini Subhash, Marton Havasi, Weiwei Pan, Finale Doshi-Velez

In this work, we survey properties defined in interpretable machine learning papers, synthesize them based on what they actually measure, and describe the trade-offs between different formulations of these properties.

Interpretable Machine Learning

Cannot find the paper you are looking for? You can Submit a new open access paper.