LXMERT: Learning Cross-Modality Encoder Representations from Transformers

IJCNLP 2019 Hao TanMohit Bansal

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections... (read more)

PDF Abstract IJCNLP 2019 PDF IJCNLP 2019 Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Visual Question Answering GQA test-dev LXMERT (Pre-train + scratch) Accuracy 60.0 # 2
Visual Question Answering GQA test-std LXMERT Accuracy 60.3 # 2
Visual Reasoning NLVR2 Dev LXMERT (Pre-train + scratch) Accuracy 74.9 # 1
Visual Reasoning NLVR2 Test LXMERT Accuracy 76.2 # 2
Visual Question Answering VizWiz LXMERT Accuracy 55.4% # 1
Visual Question Answering VQA v2 test-dev LXMERT (Pre-train + scratch) Accuracy 69.9 # 10
Visual Question Answering VQA v2 test-std LXMERT Accuracy 72.5 # 3

Methods used in the Paper