UNITER: UNiversal Image-TExt Representation Learning

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at https://github.com/ChenRocks/UNITER.

PDF Abstract ECCV 2020 PDF ECCV 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Zero-Shot Cross-Modal Retrieval Flickr30k UNITER Image-to-text R@1 80.7 # 16
Image-to-text R@5 95.7 # 17
Image-to-text R@10 98.0 # 15
Text-to-image R@1 66.2 # 17
Text-to-image R@5 88.4 # 17
Text-to-image R@10 92.9 # 15
Visual Reasoning NLVR2 Test UNITER (Large) Accuracy 79.5 # 10
Referring Expression Comprehension RefCOCO UNITER-L Val 81.41 # 14
Test A 87.04 # 12
Test B 74.17 # 14
Visual Entailment SNLI-VE test UNITER (Large) Accuracy 78.98 # 7
Visual Entailment SNLI-VE val UNITER Accuracy 78.98 # 8
Visual Question Answering (VQA) VCR (Q-AR) test UNITER (Large) Accuracy 62.8 # 3
Visual Question Answering (VQA) VCR (QA-R) test UNITER-large (ensemble of 10 models) Accuracy 83.4 # 3
Visual Question Answering (VQA) VCR (QA-R) test UNITER (Large) Accuracy 80.8 # 4
Visual Question Answering (VQA) VCR (Q-A) test UNITER-large (10 ensemble) Accuracy 79.8 # 3
Visual Question Answering (VQA) VCR (Q-A) test UNITER (Large) Accuracy 77.3 # 5
Visual Question Answering (VQA) VQA v2 test-dev UNITER (Large) Accuracy 73.24 # 21
Visual Question Answering (VQA) VQA v2 test-std UNITER (Large) overall 73.4 # 20

Methods