1 code implementation • 6 Nov 2023 • Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han
Despite remarkable success in various applications, large language models (LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails void.
1 code implementation • 6 Jun 2023 • Jianing Zhu, Xiawei Guo, Jiangchao Yao, Chao Du, Li He, Shuo Yuan, Tongliang Liu, Liang Wang, Bo Han
In this paper, we dive into the perspective of model dynamics and propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
1 code implementation • 6 Jun 2023 • Jianing Zhu, Hengzhuang Li, Jiangchao Yao, Tongliang Liu, Jianliang Xu, Bo Han
Based on such insights, we propose a novel method, Unleashing Mask, which aims to restore the OOD discriminative capabilities of the well-trained model with ID data.
1 code implementation • 1 Mar 2023 • Jianing Zhu, Jiangchao Yao, Tongliang Liu, Quanming Yao, Jianliang Xu, Bo Han
Privacy and security concerns in real-world applications have led to the development of adversarially robust federated models.
1 code implementation • 1 Nov 2022 • Jianan Zhou, Jianing Zhu, Jingfeng Zhang, Tongliang Liu, Gang Niu, Bo Han, Masashi Sugiyama
Adversarial training (AT) with imperfect supervision is significant but receives limited attention.
no code implementations • 29 Sep 2021 • Jianing Zhu, Jiangchao Yao, Tongliang Liu, Kunyang Jia, Jingren Zhou, Bo Han, Hongxia Yang
Federated Adversarial Training (FAT) helps us address the data privacy and governance issues, meanwhile maintains the model robustness to the adversarial attack.
2 code implementations • ICLR 2022 • Jianing Zhu, Jiangchao Yao, Bo Han, Jingfeng Zhang, Tongliang Liu, Gang Niu, Jingren Zhou, Jianliang Xu, Hongxia Yang
However, when considering adversarial robustness, teachers may become unreliable and adversarial distillation may not work: teachers are pretrained on their own adversarial data, and it is too demanding to require that teachers are also good at every adversarial data queried by students.
no code implementations • 6 Feb 2021 • Jianing Zhu, Jingfeng Zhang, Bo Han, Tongliang Liu, Gang Niu, Hongxia Yang, Mohan Kankanhalli, Masashi Sugiyama
A recent adversarial training (AT) study showed that the number of projected gradient descent (PGD) steps to successfully attack a point (i. e., find an adversarial example in its proximity) is an effective measure of the robustness of this point.
2 code implementations • ICLR 2021 • Jingfeng Zhang, Jianing Zhu, Gang Niu, Bo Han, Masashi Sugiyama, Mohan Kankanhalli
The belief was challenged by recent studies where we can maintain the robustness and improve the accuracy.