TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	MSR-VTT	CLIP2Video	text-to-video R@1	29.8	# 25
Video Retrieval	MSR-VTT	CLIP2Video	text-to-video R@5	55.5	# 22
Video Retrieval	MSR-VTT	CLIP2Video	text-to-video R@10	66.2	# 22
Video Retrieval	MSR-VTT	CLIP2Video	text-to-video Mean Rank	45.4	# 4
Video Retrieval	MSR-VTT	CLIP2Video	text-to-video Median Rank	4	# 7
Video Retrieval	MSR-VTT	CLIP2Video	video-to-text R@1	54.6	# 7
Video Retrieval	MSR-VTT	CLIP2Video	video-to-text R@5	82.1	# 3
Video Retrieval	MSR-VTT	CLIP2Video	video-to-text R@10	90.8	# 3
Video Retrieval	MSR-VTT	CLIP2Video	video-to-text Median Rank	1	# 1
Video Retrieval	MSR-VTT	CLIP2Video	video-to-text Mean Rank	5.3	# 2
Video Retrieval	MSR-VTT-1kA	CLIP2Video	text-to-video Mean Rank	14.6	# 18
Video Retrieval	MSR-VTT-1kA	CLIP2Video	text-to-video R@1	45.6	# 33
Video Retrieval	MSR-VTT-1kA	CLIP2Video	text-to-video R@5	72.6	# 29
Video Retrieval	MSR-VTT-1kA	CLIP2Video	text-to-video R@10	81.7	# 33
Video Retrieval	MSR-VTT-1kA	CLIP2Video	text-to-video Median Rank	2	# 10
Video Retrieval	MSR-VTT-1kA	CLIP2Video	video-to-text R@1	43.3	# 20
Video Retrieval	MSR-VTT-1kA	CLIP2Video	video-to-text R@5	72.3	# 19
Video Retrieval	MSR-VTT-1kA	CLIP2Video	video-to-text R@10	82.1	# 20
Video Retrieval	MSR-VTT-1kA	CLIP2Video	video-to-text Median Rank	2	# 7
Video Retrieval	MSR-VTT-1kA	CLIP2Video	video-to-text Mean Rank	10.2	# 17
Video Retrieval	VATEX	CLIP2Video	text-to-video R@1	57.3	# 11
Video Retrieval	VATEX	CLIP2Video	text-to-video R@50	95.5	# 2
Video Retrieval	VATEX	CLIP2Video	text-to-video R@10	90	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip2video-mastering-video-text-retrieval-via/video-retrieval-on-vatex)](https://paperswithcode.com/sota/video-retrieval-on-vatex?p=clip2video-mastering-video-text-retrieval-via)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip2video-mastering-video-text-retrieval-via/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=clip2video-mastering-video-text-retrieval-via)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/clip2video-mastering-video-text-retrieval-via/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=clip2video-mastering-video-text-retrieval-via)`

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

21 Jun 2021 · Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen ·

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

PDF Abstract

Code

Add Remove Mark official

CryhanFang/CLIP2Video official

220

Tasks

Add Remove

Language Modelling

Retrieval

Text Retrieval

Video Retrieval

Video-Text Retrieval

Video to Text Retrieval

Datasets

MSR-VTT

MSVD

VATEX

Results from the Paper

Edit

Ranked #11 on Video Retrieval on VATEX (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	MSR-VTT	CLIP2Video	text-to-video R@1	29.8	# 25	Compare
			text-to-video R@5	55.5	# 22	Compare
			text-to-video R@10	66.2	# 22	Compare
			text-to-video Mean Rank	45.4	# 4	Compare
			text-to-video Median Rank	4	# 7	Compare
			video-to-text R@1	54.6	# 7	Compare
			video-to-text R@5	82.1	# 3	Compare
			video-to-text R@10	90.8	# 3	Compare
			video-to-text Median Rank	1	# 1	Compare
			video-to-text Mean Rank	5.3	# 2	Compare
Video Retrieval	MSR-VTT-1kA	CLIP2Video	text-to-video Mean Rank	14.6	# 18	Compare
			text-to-video R@1	45.6	# 33	Compare
			text-to-video R@5	72.6	# 29	Compare
			text-to-video R@10	81.7	# 33	Compare
			text-to-video Median Rank	2	# 10	Compare
			video-to-text R@1	43.3	# 20	Compare
			video-to-text R@5	72.3	# 19	Compare
			video-to-text R@10	82.1	# 20	Compare
			video-to-text Median Rank	2	# 7	Compare
			video-to-text Mean Rank	10.2	# 17	Compare
Video Retrieval	VATEX	CLIP2Video	text-to-video R@1	57.3	# 11	Compare
			text-to-video R@50	95.5	# 2	Compare
			text-to-video R@10	90	# 9	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove