Non-robust Features through the Lens of Universal Perturbations
Recent work ties adversarial perturbations to so-called non-robust features. These are features which are susceptible to small perturbations and believed to be incomprehensible to humans, but still useful for (generalizable) prediction. We study universal adversarial perturbations and demonstrate that the above picture is more nuanced. Specifically, even though universal perturbations—similarly to standard adversarial perturbations—do leverage non-robust features, these features tend to be fundamentally different from the “standard” ones and, in particular, non-trivially human-aligned. Namely, universal perturbations are more semantic, and have human-aligned locality and spatial invariance properties. However, we also show that these semantic non-robust features have much less predictive signal than general non-robust features, indicating that the bulk of non-robust features requires additional techniques to probe their structure. Our findings thus take a step towards understanding the nature of non-robust features.
PDF Abstract