X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

22 Nov 2022  ยท  Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, Wangchunshu Zhou ยท

Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X$^2$-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X$^2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X$^2$-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X$^2$-VLM results in high transferability for it to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models are available at https://github.com/zengyan-97/X2-VLM.

PDF Abstract

Results from the Paper


 Ranked #1 on Cross-Modal Retrieval on Flickr30k (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Cross-Modal Retrieval COCO 2014 X2-VLM (large) Image-to-text R@1 84.4 # 2
Image-to-text R@10 98.5 # 1
Image-to-text R@5 96.5 # 1
Text-to-image R@1 67.7 # 3
Text-to-image R@10 92.5 # 3
Text-to-image R@5 87.5 # 3
Cross-Modal Retrieval COCO 2014 X2-VLM (base) Image-to-text R@1 83.5 # 4
Image-to-text R@10 98.5 # 1
Image-to-text R@5 96.3 # 4
Text-to-image R@1 66.2 # 6
Text-to-image R@10 92.2 # 5
Text-to-image R@5 87.1 # 6
Cross-Modal Retrieval Flickr30k X2-VLM (base) Image-to-text R@1 98.5 # 2
Image-to-text R@10 100 # 1
Image-to-text R@5 100 # 1
Text-to-image R@1 90.4 # 4
Text-to-image R@10 99.3 # 5
Text-to-image R@5 98.2 # 5
Cross-Modal Retrieval Flickr30k X2-VLM (large) Image-to-text R@1 98.8 # 1
Image-to-text R@10 100 # 1
Image-to-text R@5 100 # 1
Text-to-image R@1 91.8 # 2
Text-to-image R@10 99.5 # 2
Text-to-image R@5 98.6 # 3
Video Retrieval MSR-VTT-1kA X2-VLM (base) text-to-video R@1 47.6 # 26
text-to-video R@5 74.1 # 22
text-to-video R@10 84.2 # 16
Video Retrieval MSR-VTT-1kA X2-VLM (large) text-to-video R@1 49.6 # 17
text-to-video R@5 76.7 # 12
text-to-video R@10 84.2 # 16
Visual Question Answering (VQA) MSRVTT-QA X2-VLM (base) Accuracy 0.45 # 16
Visual Question Answering (VQA) MSRVTT-QA X2-VLM (large) Accuracy 0.455 # 15
Visual Question Answering (VQA) MSVD-QA X2-VLM (base) Accuracy 0.528 # 18
Visual Question Answering (VQA) MSVD-QA X2-VLM (large) Accuracy 0.546 # 17
Visual Reasoning NLVR2 Dev X2-VLM (base) Accuracy 86.2 # 4
Visual Reasoning NLVR2 Dev X2-VLM (large) Accuracy 88.7 # 2
Visual Reasoning NLVR2 Test X2-VLM (large) Accuracy 89.4 # 2
Visual Reasoning NLVR2 Test X2-VLM (base) Accuracy 87.0 # 4
Visual Grounding RefCOCO+ testA X2-VLM (base) Accuracy (%) 90.3 # 4
Visual Grounding RefCOCO+ testA X2-VLM (large) Accuracy (%) 92.1 # 2
Visual Grounding RefCOCO+ test B X2-VLM (base) Accuracy (%) 78.4 # 4
Visual Grounding RefCOCO+ test B X2-VLM (large) Accuracy (%) 81.8 # 2
Visual Grounding RefCOCO+ val X2-VLM (large) Accuracy (%) 87.6 # 2
Visual Grounding RefCOCO+ val X2-VLM (base) Accuracy (%) 85.2 # 4
Visual Question Answering (VQA) VQA v2 test-dev X2-VLM (large) Accuracy 81.9 # 6
Visual Question Answering (VQA) VQA v2 test-dev X2-VLM (base) Accuracy 80.4 # 10
Visual Question Answering (VQA) VQA v2 test-std X2-VLM (large) overall 81.8 # 4
Visual Question Answering (VQA) VQA v2 test-std X2-VLM (base) overall 80.2 # 8

Methods


BASE โ€ข XLM-R