Video Retrieval

221 papers with code • 18 benchmarks • 31 datasets

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Retrieval

Dataset	Best Model	Compare
MSR-VTT-1kA	HunYuan_tvr (huge)	See all
LSMDC	InternVideo2-6B	See all
MSR-VTT	VAST	See all
DiDeMo	InternVideo2-6B	See all
ActivityNet	InternVideo2-6B	See all
MSVD	InternVideo2-6B	See all
FIVR-200K	S2VS	See all
YouCook2	VAST	See all
VATEX	VAST	See all
QuerYD	QB-Norm+TT-CE+	See all
SSv2-label retrieval	UMT-L (ViT-L/16)	See all
SSv2-template retrieval	UMT-L (ViT-L/16)	See all
Condensed Movies	TESTA (ViT-B/16)	See all
TVR	Hero w/ pre-training	See all
TGIF	MDMMT-2	See all
RUDDER	PO Loss	See all
Charades-STA	PO Loss	See all
MSVD-Indonesian	X-CLIP (Cross-Lingual)	See all

Show all 18 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Retrieval models and implementations

towhee-io/towhee

5 papers

3,001

jpthu17/diffusionret

4 papers

albanie/collaborative-experts

3 papers

328

pytorch/fairseq

2 papers

29,301

See all 5 libraries.

Datasets

Subtasks

Replay Grounding

Composed Video Retrieval (CoVR)

Most implemented papers

Most implemented Social Latest No code

Hashing with Mutual Information

fcakir/mihash • 2 Mar 2018

Binary vector embeddings enable fast nearest neighbor retrieval in large databases of high-dimensional objects, and play an important role in many practical applications, such as image and video retrieval.

Paper
Code

Person Search in Videos with One Portrait Through Visual and Temporal Links

hqqasw/person-search-PPCC • • ECCV 2018

In real-world applications, e. g. law enforcement and video retrieval, one often needs to search a certain person in long videos with just one portrait.

Paper
Code

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

antoine77340/howto100m • • ECCV 2018

We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e. g. a video clip and a language sentence).

Paper
Code

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

jayleicn/TVRetrieval • • ECCV 2020

The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.

Paper
Code

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

microsoft/UniVL • • 15 Feb 2020

However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks.

Paper
Code

Targeted Attack for Deep Hashing based Retrieval

jiawangbai/DHTA-master • • ECCV 2020

In this paper, we propose a novel method, dubbed deep hashing targeted attack (DHTA), to study the targeted attack on such retrieval.

Paper
Code

Hysia: Serving DNN-Based Video-to-Retail Applications in Cloud

cap-ntu/Video-to-Retail-Platform • • 9 Jun 2020

Combining \underline{v}ideo streaming and online \underline{r}etailing (V2R) has been a growing trend recently.

Paper
Code

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

BestJuly/Inter-intra-video-contrastive-learning • • 6 Aug 2020

With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations.

Paper
Code

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

laura-wang/video_repres_sts • • 31 Aug 2020

Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc.

Paper
Code

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

google-research/google-research • • NeurIPS 2021

We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.

Paper
Code

Video Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result