Search Results for author: Jiongxiao Wang

Found 10 papers, 4 papers with code

Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment

no code implementations • 22 Feb 2024 • Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, Chaowei Xiao

Despite the general capabilities of Large Language Models (LLMs) like GPT-4 and Llama-2, these models still request fine-tuning or adaptation with customized data when it comes to meeting the specific business demands and intricacies of tailored use cases.

Paper
Add Code

Preference Poisoning Attacks on Reward Model Learning

no code implementations • 2 Feb 2024 • Junlin Wu, Jiongxiao Wang, Chaowei Xiao, Chenguang Wang, Ning Zhang, Yevgeniy Vorobeychik

In addition, we observe that the simpler and more scalable rank-by-distance approaches are often competitive with the best, and on occasion significantly outperform gradient-based methods.

Paper
Add Code

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations

no code implementations • 16 Nov 2023 • Wenjie Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan, Chaowei Xiao, Muhao Chen

Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense.

backdoor defense

Paper
Add Code

On the Exploitability of Reinforcement Learning with Human Feedback for Large Language Models

no code implementations • 16 Nov 2023 • Jiongxiao Wang, Junlin Wu, Muhao Chen, Yevgeniy Vorobeychik, Chaowei Xiao

Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences, playing an important role in LLMs alignment.

Backdoor Attack Data Poisoning

Paper
Add Code

On the Exploitability of Instruction Tuning

1 code implementation • NeurIPS 2023 • Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom Goldstein

In this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into the training data that intentionally changes the model's behavior.

Data Poisoning Instruction Following

Paper
Code

ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback

1 code implementation • 29 May 2023 • Shengchao Liu, Jiongxiao Wang, Yijin Yang, Chengpeng Wang, Ling Liu, Hongyu Guo, Chaowei Xiao

This research sheds light on the potential of ChatGPT and conversational LLMs for drug editing.

Decision Making Drug Discovery +3

121

Paper
Code

Adversarial Demonstration Attacks on Large Language Models

no code implementations • 24 May 2023 • Jiongxiao Wang, Zichen Liu, Keun Hee Park, Zhuojun Jiang, Zhaoheng Zheng, Zhuofeng Wu, Muhao Chen, Chaowei Xiao

We propose a novel attack method named advICL, which aims to manipulate only the demonstration without changing the input to mislead the models.

In-Context Learning

Paper
Add Code

Defending against Adversarial Audio via Diffusion Model

1 code implementation • 2 Mar 2023 • Shutong Wu, Jiongxiao Wang, Wei Ping, Weili Nie, Chaowei Xiao

In this paper, we propose an adversarial purification-based defense pipeline, AudioPure, for acoustic systems via off-the-shelf diffusion models.

Paper
Code

DensePure: Understanding Diffusion Models towards Adversarial Robustness

no code implementations • 1 Nov 2022 • Chaowei Xiao, Zhongzhu Chen, Kun Jin, Jiongxiao Wang, Weili Nie, Mingyan Liu, Anima Anandkumar, Bo Li, Dawn Song

By using the highest density point in the conditional distribution as the reversed sample, we identify the robust region of a given instance under the diffusion model's reverse process.

Adversarial Robustness Denoising

Paper
Add Code

Fast and Reliable Evaluation of Adversarial Robustness with Minimum-Margin Attack

1 code implementation • 15 Jun 2022 • Ruize Gao, Jiongxiao Wang, Kaiwen Zhou, Feng Liu, Binghui Xie, Gang Niu, Bo Han, James Cheng

The AutoAttack (AA) has been the most reliable method to evaluate adversarial robustness when considerable computational resources are available.

Adversarial Robustness Computational Efficiency

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.