no code implementations • 29 Apr 2024 • Guoliang Dong, Haoyu Wang, Jun Sun, Xinyu Wang
The results show that LLMs exhibit stronger human alignment capabilities with queries in English, French, Russian, and Spanish (only 1. 04\% of harmful queries successfully jailbreak on average) compared to queries in Bengali, Georgian, Nepali and Maithili (27. 7\% of harmful queries jailbreak successfully on average).
no code implementations • 29 Dec 2021 • Guoliang Dong, Jingyi Wang, Jun Sun, Sudipta Chattopadhyay, Xinyu Wang, Ting Dai, Jie Shi, Jin Song Dong
Furthermore, such attacks are impossible to eliminate, i. e., the adversarial perturbation is still possible after applying mitigation methods such as adversarial training.
no code implementations • 17 Jul 2021 • Peixin Zhang, Jingyi Wang, Jun Sun, Xinyu Wang, Guoliang Dong, Xingen Wang, Ting Dai, Jin Song Dong
In this work, we bridge the gap by proposing a scalable and effective approach for systematically searching for discriminatory samples while extending existing fairness testing approaches to address a more challenging domain, i. e., text classification.
no code implementations • 3 Dec 2020 • Guoliang Dong, Jun Sun, Jingyi Wang, Xinyu Wang, Ting Dai
Neural networks are increasingly applied to support decision making in safety-critical applications (like autonomous cars, unmanned aerial vehicles and face recognition based authentication).
1 code implementation • 22 Sep 2019 • Guoliang Dong, Jingyi Wang, Jun Sun, Yang Zhang, Xinyu Wang, Ting Dai, Jin Song Dong, Xingen Wang
In this work, we propose an approach to extract probabilistic automata for interpreting an important class of neural networks, i. e., recurrent neural networks.
5 code implementations • 14 Dec 2018 • Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, Peixin Zhang
We thus first propose a measure of `sensitivity' and show empirically that normal samples and adversarial samples have distinguishable sensitivity.