Balanced Adversarial Training: Balancing Tradeoffs Between Oversensitivity and Undersensitivity in NLP Models

ACL ARR January 2022 · Anonymous ·

Traditional (\emph{oversensitive}) adversarial examples involve finding a small perturbation that does not change an input's true label but confuses the classifier into outputting a different prediction. \emph{Undersensitive} adversarial examples are the opposite---the adversary's goal is to find a small perturbation that changes the true label of an input while preserving the classifier's prediction. Adversarial training and certified robust training have shown some effectiveness in improving the robustness of machine learnt models to oversensitive adversarial examples. However, recent work has shown that using these techniques to improve robustness for image classifiers may make a model more vulnerable to undersensitive adversarial examples. We demonstrate the same phenomenon applies to NLP models, showing that training methods that improve robustness to synonym-based attacks (oversensitive adversarial examples) tend to increase a model's vulnerability to antonym-based attacks (undersensitive adversarial examples) for both natural language inference and paraphrase identification tasks. To counter this phenomenon, we introduce \textit{Balanced Adversarial Training} which incorporates contrastive learning to increase robustness against both over- and undersensitive adversarial examples.

PDF Abstract