Large-Scale Adversarial Training for Vision-and-Language Representation Learning

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.

PDF Abstract NeurIPS 2020 PDF NeurIPS 2020 Abstract

Results from the Paper


Ranked #7 on Visual Entailment on SNLI-VE val (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Referring Expression Comprehension RefCoco+ VILLA-large Val 76.17 # 8
Test A 81.54 # 8
Test B 66.84 # 8
Referring Expression Comprehension RefCOCO VILLA-large Val 82.39 # 12
Test A 87.48 # 11
Test B 74.84 # 13
Referring Expression Comprehension RefCOCOg-test VILLA-large Accuracy 76.71 # 8
Referring Expression Comprehension RefCOCOg-val VILLA-large Accuracy 76.18 # 9
Visual Entailment SNLI-VE val VILLA-LARGE Accuracy 80.18 # 7

Methods


No methods listed for this paper. Add relevant methods here