UNITER: UNiversal Image-TExt Representation Learning

ECCV 2020 Yen-Chun ChenLinjie LiLicheng YuAhmed El KholyFaisal AhmedZhe GanYu ChengJingjing Liu

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings... (read more)

PDF Abstract ECCV 2020 PDF ECCV 2020 Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Visual Reasoning NLVR2 Test UNITER (Large) Accuracy 79.5 # 1
Visual Entailment SNLI-VE test UNITER (Large) Accuracy 78.98 # 1
Visual Question Answering VCR (Q-AR) test UNITER (Large) Accuracy 62.8 # 1
Visual Question Answering VCR (QA-R) test UNITER (Large) Accuracy 80.8 # 1
Visual Question Answering VCR (Q-A) test UNITER (Large) Accuracy 77.3 # 1
Visual Question Answering VQA v2 test-dev UNITER (Large) Accuracy 73.24 # 2
Visual Question Answering VQA v2 test-std UNITER (Large) Accuracy 73.4 # 1

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet