Video Retrieval

221 papers with code • 18 benchmarks • 31 datasets

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Retrieval

Dataset	Best Model	Compare
MSR-VTT-1kA	HunYuan_tvr (huge)	See all
LSMDC	InternVideo2-6B	See all
MSR-VTT	VAST	See all
DiDeMo	InternVideo2-6B	See all
ActivityNet	InternVideo2-6B	See all
MSVD	InternVideo2-6B	See all
FIVR-200K	S2VS	See all
YouCook2	VAST	See all
VATEX	VAST	See all
QuerYD	QB-Norm+TT-CE+	See all
SSv2-label retrieval	UMT-L (ViT-L/16)	See all
SSv2-template retrieval	UMT-L (ViT-L/16)	See all
Condensed Movies	TESTA (ViT-B/16)	See all
TVR	Hero w/ pre-training	See all
TGIF	MDMMT-2	See all
RUDDER	PO Loss	See all
Charades-STA	PO Loss	See all
MSVD-Indonesian	X-CLIP (Cross-Lingual)	See all

Show all 18 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Retrieval models and implementations

towhee-io/towhee

5 papers

3,001

jpthu17/diffusionret

4 papers

albanie/collaborative-experts

3 papers

328

pytorch/fairseq

2 papers

29,301

See all 5 libraries.

Datasets

Subtasks

Replay Grounding

Composed Video Retrieval (CoVR)

Latest papers

Most implemented Social Latest No code

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

whwu95/ATM • • 27 Nov 2023

In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video.

27 Nov 2023

Paper
Code

VideoCon: Robust Video-Language Alignment via Contrast Captions

hritikbansal/videocon • • 15 Nov 2023

Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.

15 Nov 2023

Paper
Code

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

renshuhuai-andy/testa • • 29 Oct 2023

TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding.

29 Oct 2023

Paper
Code

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

wengzejia1/open-vclip • • 8 Oct 2023

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition.

08 Oct 2023

Paper
Code

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval

huangmozhi9527/GMMFormer • • 8 Oct 2023

Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead.

08 Oct 2023

Paper
Code

Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

intellabs/multimodal_cognitive_ai • • 7 Oct 2023

We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP).

07 Oct 2023

Paper
Code

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

ninatu/howtocaption • 7 Oct 2023

Specifically, we prompt an LLM to create plausible video descriptions based on ASR narrations of the video for a large-scale instructional video dataset.

07 Oct 2023

Paper
Code

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

leolee99/pau • • NeurIPS 2023

In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.

29 Sep 2023

Paper
Code

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

alipay/Ant-Multi-Modal-Framework • • 20 Sep 2023

We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs.

20 Sep 2023

Paper
Code

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

ziyang412/ucofia • • ICCV 2023

Specifically, our model captures the cross-modal similarity information at different granularity levels.

18 Sep 2023

Paper
Code

Video Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result