Perturbation Type Categorization for Multiple $\ell_p$ Bounded Adversarial Robustness

1 Jan 2021 · Pratyush Maini, Xinyun Chen, Bo Li, Dawn Song ·

Despite the recent advances in $\textit{adversarial training}$ based defenses, deep neural networks are still vulnerable to adversarial attacks outside the perturbation type they are trained to be robust against. Recent works have proposed defenses to improve the robustness of a single model against the union of multiple perturbation types. However, when evaluating the model against each individual attack, these methods still suffer significant trade-offs compared to the ones specifically trained to be robust against that perturbation type. In this work, we introduce the problem of categorizing adversarial examples based on their $\ell_p$ perturbation types. Based on our analysis, we propose $\textit{PROTECTOR}$, a two-stage pipeline to improve the robustness against multiple perturbation types. Instead of training a single predictor, $\textit{PROTECTOR}$ first categorizes the perturbation type of the input, and then utilizes a predictor specifically trained against the predicted perturbation type to make the final prediction. We first theoretically show that adversarial examples created by different perturbation types constitute different distributions, which makes it possible to distinguish them. Further, we show that at test time the adversary faces a natural trade-off between fooling the perturbation type classifier and the succeeding predictor optimized with perturbation specific adversarial training. This makes it challenging for an adversary to plant strong attacks against the whole pipeline. In addition, we demonstrate the realization of this trade-off in deep networks by adding random noise to the model input at test time, enabling enhanced robustness against strong adaptive attacks. Extensive experiments on MNIST and CIFAR-10 show that $\textit{PROTECTOR}$ outperforms prior adversarial training based defenses by over $5\%$, when tested against the union of $\ell_1, \ell_2, \ell_\infty$ attacks.

PDF Abstract