VisualBERT: A Simple and Performant Baseline for Vision and Language

9 Aug 2019Liunian Harold LiMark YatskarDa YinCho-Jui HsiehKai-Wei Chang

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Phrase Grounding Flickr30k Entities Dev VisualBERT [email protected] 70.4 # 1
[email protected] 86.31 # 1
[email protected] 84.49 # 1
Phrase Grounding Flickr30k Entities Test VisualBERT [email protected] 71.33 # 1
[email protected] 86.51 # 1
[email protected] 84.98 # 1
Visual Reasoning NLVR VisualBERT Accuracy (Dev) 67.4% # 1
Accuracy (Test-P) 67% # 1
Accuracy (Test-U) 67.3% # 1
Visual Reasoning NLVR2 Dev VisualBERT Accuracy 66.7 # 2
Visual Question Answering VCR (Q-A) dev VisualBERT Accuracy 70.8 # 3
Visual Question Answering VCR (Q-AR) dev VisualBERT Accuracy 52.2 # 3
Visual Question Answering VCR (QA-R) dev VisualBERT Accuracy 73.2 # 3
Visual Question Answering VCR (Q-AR) test VisualBERT Accuracy 52.4 # 3
Visual Question Answering VCR (QA-R) test VisualBERT Accuracy 73.2 # 3
Visual Question Answering VCR (Q-A) test VisualBERT Accuracy 71.6 # 3
Visual Question Answering VQA v2 test-dev VisualBERT Accuracy 70.8 # 7
Visual Question Answering VQA v2 test-std VisualBERT Accuracy 71 # 6

Methods used in the Paper