COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

15 Jun 2023  ยท  Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, Jing Liu ยท

Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations. To address this issue, we propose COSA, a COncatenated SAmple pretrained vision-language foundation model. COSA jointly models visual contents and event-level temporal cues using only image-text corpora. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus, enabling richer scene transformations and explicit event-description correspondence. Extensive experiments demonstrate that COSA consistently improves performance across a broad range of downstream tasks, including long-form/short-form video-text tasks and image-text tasks such as retrieval, captioning, and question answering. Notably, COSA achieves state-of-the-art results on various competitive benchmarks. Code and model are released at https://github.com/TXH-mercury/COSA.

PDF Abstract

Results from the Paper


 Ranked #1 on TGIF-Frame on TGIF-QA (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Benchmark
Video Retrieval ActivityNet COSA text-to-video R@1 67.3 # 4
Video Question Answering ActivityNet-QA COSA Accuracy 49.9 # 6
Video Retrieval DiDeMo COSA text-to-video R@1 70.5 # 4
Video Retrieval LSMDC COSA text-to-video R@1 39.4 # 5
Video Retrieval MSR-VTT COSA text-to-video R@1 57.9 # 6
Video Captioning MSR-VTT COSA CIDEr 74.7 # 5
BLEU-4 53.7 # 7
Video Question Answering MSRVTT-QA COSA Accuracy 49.2 # 3
Video Captioning MSVD COSA CIDEr 178.5 # 3
BLEU-4 76.5 # 3
Visual Question Answering (VQA) MSVD-QA COSA Accuracy 0.60 # 4
TGIF-Frame TGIF-QA COSA Accuracy 79.5 # 1
Video Captioning TVC COSA BLEU-4 18.8 # 2
CIDEr 70.7 # 2
Video Captioning VATEX COSA BLEU-4 43.7 # 3
CIDEr 96.5 # 2
Video Captioning YouCook2 COSA BLEU-4 10.1 # 9
CIDEr 1.31 # 6

Methods