CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

21 Jun 2021  ·  Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen ·

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

PDF Abstract

Results from the Paper


Ranked #11 on Video Retrieval on VATEX (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Retrieval MSR-VTT CLIP2Video text-to-video R@1 29.8 # 25
text-to-video R@5 55.5 # 22
text-to-video R@10 66.2 # 22
text-to-video Mean Rank 45.4 # 4
text-to-video Median Rank 4 # 7
video-to-text R@1 54.6 # 7
video-to-text R@5 82.1 # 3
video-to-text R@10 90.8 # 3
video-to-text Median Rank 1 # 1
video-to-text Mean Rank 5.3 # 2
Video Retrieval MSR-VTT-1kA CLIP2Video text-to-video Mean Rank 14.6 # 18
text-to-video R@1 45.6 # 33
text-to-video R@5 72.6 # 29
text-to-video R@10 81.7 # 33
text-to-video Median Rank 2 # 10
video-to-text R@1 43.3 # 20
video-to-text R@5 72.3 # 19
video-to-text R@10 82.1 # 20
video-to-text Median Rank 2 # 7
video-to-text Mean Rank 10.2 # 17
Video Retrieval VATEX CLIP2Video text-to-video R@1 57.3 # 11
text-to-video R@50 95.5 # 2
text-to-video R@10 90 # 9

Methods


No methods listed for this paper. Add relevant methods here