Search Results for author: Jiongxiao Wang

Found 10 papers, 4 papers with code

Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment

no code implementations22 Feb 2024 Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, Chaowei Xiao

Despite the general capabilities of Large Language Models (LLMs) like GPT-4 and Llama-2, these models still request fine-tuning or adaptation with customized data when it comes to meeting the specific business demands and intricacies of tailored use cases.

Preference Poisoning Attacks on Reward Model Learning

no code implementations2 Feb 2024 Junlin Wu, Jiongxiao Wang, Chaowei Xiao, Chenguang Wang, Ning Zhang, Yevgeniy Vorobeychik

In addition, we observe that the simpler and more scalable rank-by-distance approaches are often competitive with the best, and on occasion significantly outperform gradient-based methods.

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

no code implementations16 Nov 2023 Wenjie Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan, Chaowei Xiao, Muhao Chen

Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense.

backdoor defense

On the Exploitability of Reinforcement Learning with Human Feedback for Large Language Models

no code implementations16 Nov 2023 Jiongxiao Wang, Junlin Wu, Muhao Chen, Yevgeniy Vorobeychik, Chaowei Xiao

Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences, playing an important role in LLMs alignment.

Backdoor Attack Data Poisoning

On the Exploitability of Instruction Tuning

1 code implementation NeurIPS 2023 Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom Goldstein

In this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into the training data that intentionally changes the model's behavior.

Data Poisoning Instruction Following

Adversarial Demonstration Attacks on Large Language Models

no code implementations24 May 2023 Jiongxiao Wang, Zichen Liu, Keun Hee Park, Zhuojun Jiang, Zhaoheng Zheng, Zhuofeng Wu, Muhao Chen, Chaowei Xiao

We propose a novel attack method named advICL, which aims to manipulate only the demonstration without changing the input to mislead the models.

In-Context Learning

Defending against Adversarial Audio via Diffusion Model

1 code implementation2 Mar 2023 Shutong Wu, Jiongxiao Wang, Wei Ping, Weili Nie, Chaowei Xiao

In this paper, we propose an adversarial purification-based defense pipeline, AudioPure, for acoustic systems via off-the-shelf diffusion models.

DensePure: Understanding Diffusion Models towards Adversarial Robustness

no code implementations1 Nov 2022 Chaowei Xiao, Zhongzhu Chen, Kun Jin, Jiongxiao Wang, Weili Nie, Mingyan Liu, Anima Anandkumar, Bo Li, Dawn Song

By using the highest density point in the conditional distribution as the reversed sample, we identify the robust region of a given instance under the diffusion model's reverse process.

Adversarial Robustness Denoising

Fast and Reliable Evaluation of Adversarial Robustness with Minimum-Margin Attack

1 code implementation15 Jun 2022 Ruize Gao, Jiongxiao Wang, Kaiwen Zhou, Feng Liu, Binghui Xie, Gang Niu, Bo Han, James Cheng

The AutoAttack (AA) has been the most reliable method to evaluate adversarial robustness when considerable computational resources are available.

Adversarial Robustness Computational Efficiency

Cannot find the paper you are looking for? You can Submit a new open access paper.