VinVL: Revisiting Visual Representations in Vision-Language Models

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

PDF Abstract CVPR 2021 PDF CVPR 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Image Captioning COCO Captions VinVL BLEU-4 41.0 # 15
METEOR 31.1 # 10
CIDER 140.9 # 16
SPICE 25.2 # 8
Image-text matching CommercialAdsDataset VinVL ADD(S) AUC 88.56 # 2
Visual Question Answering (VQA) GQA Test2019 Single Model Accuracy 64.65 # 11
Binary 82.63 # 3
Open 48.77 # 14
Consistency 94.35 # 4
Plausibility 84.98 # 25
Validity 96.62 # 7
Distribution 4.72 # 114
Image Captioning nocaps entire VinVL (Microsoft Cognitive Services + MSR) CIDEr 92.46 # 13
B1 81.59 # 11
B2 65.15 # 11
B3 45.04 # 13
B4 26.15 # 13
ROUGE-L 56.96 # 12
METEOR 27.57 # 13
SPICE 13.07 # 13
Image Captioning nocaps in-domain VinVL (Microsoft Cognitive Services + MSR) CIDEr 97.99 # 15
B1 83.24 # 11
B2 68.04 # 12
B3 49.68 # 14
B4 30.62 # 14
ROUGE-L 58.54 # 14
METEOR 29.51 # 13
SPICE 13.63 # 14
Image Captioning nocaps near-domain VinVL (Microsoft Cognitive Services + MSR) CIDEr 95.16 # 13
B1 82.77 # 11
B2 66.94 # 11
B3 47.02 # 13
B4 27.97 # 13
ROUGE-L 57.95 # 13
METEOR 28.24 # 14
SPICE 13.36 # 15
Image Captioning nocaps out-of-domain VinVL (Microsoft Cognitive Services + MSR) CIDEr 78.01 # 15
B1 75.78 # 14
B2 56.1 # 16
B3 34.02 # 16
B4 15.86 # 17
ROUGE-L 51.99 # 14
METEOR 23.55 # 17
SPICE 11.48 # 15
Image Captioning nocaps-val-in-domain VinVL CIDEr 103.1 # 10
SPICE 14.2 # 9
Pre-train (#images) 5.7M # 5
Image Captioning nocaps-val-near-domain VinVL CIDEr 96.1 # 9
SPICE 13.8 # 8
Pre-train (#images) 5.7M # 6
Image Captioning nocaps-val-out-domain VinVL CIDEr 88.3 # 10
SPICE 12.1 # 8
Pretrain (#images) 5.7M # 6
Image Captioning nocaps-val-overall VinVL CIDEr 95.5 # 9
SPICE 13.5 # 8
Pretrain (#images) 5.7M # 6
Visual Question Answering (VQA) VQA v2 test-std MSR + MS Cog. Svcs. overall 76.63 # 14
yes/no 92.04 # 6
number 61.5 # 5
other 66.68 # 6
Visual Question Answering (VQA) VQA v2 test-std MSR + MS Cog. Svcs., X10 models overall 77.45 # 13
yes/no 92.38 # 5
number 62.55 # 4
other 67.87 # 5

Methods


No methods listed for this paper. Add relevant methods here