1 code implementation • 26 Feb 2024 • Yihan Wang, Zhouxing Shi, Andrew Bai, Cho-Jui Hsieh
The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and is not directly manipulated by the attacker.
no code implementations • 12 Feb 2024 • Andrew Bai, Chih-Kuan Yeh, Cho-Jui Hsieh, Ankur Taly
We propose a novel sampling scheme, mix-cd, that identifies and prioritizes samples that actually face forgetting, which we call collateral damage.
no code implementations • 17 Jan 2024 • Tong Xie, Haoyu Li, Andrew Bai, Cho-Jui Hsieh
Data attribution methods trace model behavior back to its training dataset, offering an effective approach to better understand ''black-box'' neural networks.
1 code implementation • 21 Oct 2022 • Andrew Bai, Cho-Jui Hsieh, Wendy Kan, Hsuan-Tien Lin
In this paper, we propose memorization rejection, a training scheme that rejects generated samples that are near-duplicates of training samples during training.
1 code implementation • 31 Aug 2022 • Andrew Bai, Chih-Kuan Yeh, Pradeep Ravikumar, Neil Y. C. Lin, Cho-Jui Hsieh
We showed that for a general (potentially non-linear) concept, we can mathematically evaluate how a small change of concept affecting the model's prediction, which leads to an extension of gradient-based interpretation to the concept space.