Video Retrieval

221 papers with code • 18 benchmarks • 31 datasets

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Retrieval

Dataset	Best Model	Compare
MSR-VTT-1kA	HunYuan_tvr (huge)	See all
LSMDC	InternVideo2-6B	See all
MSR-VTT	VAST	See all
DiDeMo	InternVideo2-6B	See all
ActivityNet	InternVideo2-6B	See all
MSVD	InternVideo2-6B	See all
FIVR-200K	S2VS	See all
YouCook2	VAST	See all
VATEX	VAST	See all
QuerYD	QB-Norm+TT-CE+	See all
SSv2-label retrieval	UMT-L (ViT-L/16)	See all
SSv2-template retrieval	UMT-L (ViT-L/16)	See all
Condensed Movies	TESTA (ViT-B/16)	See all
TVR	Hero w/ pre-training	See all
TGIF	MDMMT-2	See all
RUDDER	PO Loss	See all
Charades-STA	PO Loss	See all
MSVD-Indonesian	X-CLIP (Cross-Lingual)	See all

Show all 18 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Retrieval models and implementations

towhee-io/towhee

5 papers

2,987

jpthu17/diffusionret

4 papers

albanie/collaborative-experts

3 papers

327

pytorch/fairseq

2 papers

29,251

See all 5 libraries.

Datasets

Subtasks

Replay Grounding

Composed Video Retrieval (CoVR)

Latest papers

Most implemented Social Latest No code

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

jiamian-wang/t-mass-text-video-retrieval • • 26 Mar 2024

Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval.

26 Mar 2024

Paper
Code

Composed Video Retrieval via Enriched Context and Discriminative Embeddings

omkarthawakar/composed-video-retrieval • • 25 Mar 2024

Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases.

25 Mar 2024

Paper
Code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo • • 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

917

22 Mar 2024

Paper
Code

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

mlvlab/vid-tldr • • 20 Mar 2024

To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training.

20 Mar 2024

Paper
Code

Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement

hdy007007/prem • 21 Feb 2024

We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.

21 Feb 2024

Paper
Code

Multi-granularity Correspondence Learning from Long-term Noisy Videos

XLearning-SCU/2024-ICLR-Norton • • 30 Jan 2024

Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos.

30 Jan 2024

Paper
Code

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

knightyxp/dgl • • 19 Jan 2024

(2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap.

19 Jan 2024

Paper
Code

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

bytedance/Shot2Story • • 16 Dec 2023

A human need to capture both the event in every shot and associate them together to understand the story behind it.

16 Dec 2023

Paper
Code

Let All be Whitened: Multi-teacher Distillation for Efficient Visual Retrieval

maryeon/whiten_mtd • • 15 Dec 2023

Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval.

15 Dec 2023

Paper
Code

RTQ: Rethinking Video-language Understanding Based on Image-text Model

SCZwangxiao/RTQ-MM2023 • • 1 Dec 2023

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos.

01 Dec 2023

Paper
Code

Video Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result