Video Retrieval

220 papers with code • 18 benchmarks • 31 datasets

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Retrieval

Dataset	Best Model	Compare
MSR-VTT-1kA	HunYuan_tvr (huge)	See all
LSMDC	InternVideo2-6B	See all
MSR-VTT	VAST	See all
DiDeMo	InternVideo2-6B	See all
ActivityNet	InternVideo2-6B	See all
MSVD	InternVideo2-6B	See all
FIVR-200K	S2VS	See all
YouCook2	VAST	See all
VATEX	VAST	See all
QuerYD	QB-Norm+TT-CE+	See all
SSv2-label retrieval	UMT-L (ViT-L/16)	See all
SSv2-template retrieval	UMT-L (ViT-L/16)	See all
Condensed Movies	TESTA (ViT-B/16)	See all
TVR	Hero w/ pre-training	See all
TGIF	MDMMT-2	See all
RUDDER	PO Loss	See all
Charades-STA	PO Loss	See all
MSVD-Indonesian	X-CLIP (Cross-Lingual)	See all

Show all 18 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Retrieval models and implementations

towhee-io/towhee

5 papers

2,983

jpthu17/diffusionret

4 papers

albanie/collaborative-experts

3 papers

327

pytorch/fairseq

2 papers

29,225

See all 5 libraries.

Datasets

Subtasks

Replay Grounding

Composed Video Retrieval (CoVR)

Most implemented papers

Most implemented Social Latest No code

ECO: Efficient Convolutional Network for Online Video Understanding

mzolfaghari/ECO-efficient-video-understanding • • ECCV 2018

In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.

Paper
Code

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

antoine77340/Mixture-of-Embedding-Experts • • 7 Apr 2018

We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.

Paper
Code

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

m-bain/frozen-in-time • • ICCV 2021

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

Paper
Code

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

ArrowLuo/CLIP4Clip • • 18 Apr 2021

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.

Paper
Code

CoCa: Contrastive Captioners are Image-Text Foundation Models

mlfoundations/open_clip • • 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

Paper
Code

Dense-Captioning Events in Videos

sangminwoo/explore-and-match • • ICCV 2017

We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events.

Paper
Code

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

antoine77340/MIL-NCE_HowTo100M • • ICCV 2019

In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.

Paper
Code

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

antoine77340/MIL-NCE_HowTo100M • • CVPR 2020

Annotating videos is cumbersome, expensive and not scalable.

Paper
Code

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

jpthu17/emcl • • 21 Nov 2022

Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.

Paper
Code

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

whwu95/Cap4Video • • CVPR 2023

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences.

Paper
Code

Video Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result