A Straightforward Framework For Video Retrieval Using CLIP

Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always feasible in practice. In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations. This model was explicitly trained to learn a common space where images and text can be compared. Using various techniques described in this document, we extended its application to videos, obtaining state-of-the-art results on the MSR-VTT and MSVD benchmarks.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Video Retrieval LSMDC CLIP text-to-video R@1 11.3 # 31
text-to-video R@5 22.7 # 29
text-to-video R@10 29.2 # 29
text-to-video Median Rank 56.5 # 22
video-to-text R@1 6.8 # 14
video-to-text R@5 16.4 # 11
video-to-text R@10 22.1 # 10
video-to-text Median Rank 73 # 6
Video Retrieval MSR-VTT CLIP text-to-video R@1 21.4 # 31
text-to-video R@5 41.1 # 29
text-to-video R@10 50.4 # 29
text-to-video Median Rank 10 # 13
video-to-text R@1 40.3 # 9
video-to-text R@5 69.7 # 7
video-to-text R@10 79.2 # 7
video-to-text Median Rank 2 # 3
Video Retrieval MSR-VTT-1kA CLIP text-to-video R@1 31.2 # 43
text-to-video R@5 53.7 # 49
text-to-video R@10 64.2 # 52
text-to-video Median Rank 4 # 28
video-to-text R@1 27.2 # 23
video-to-text R@5 51.7 # 21
video-to-text R@10 62.6 # 22
video-to-text Median Rank 5 # 18
Video Retrieval MSVD CLIP text-to-video R@1 37 # 21
text-to-video R@5 64.1 # 20
text-to-video R@10 73.8 # 19
text-to-video Median Rank 3.0 # 14
video-to-text R@1 59.9 # 15
video-to-text R@5 85.2 # 12
video-to-text R@10 90.7 # 12
video-to-text Median Rank 1 # 1

Methods


No methods listed for this paper. Add relevant methods here