Video Retrieval

221 papers with code • 18 benchmarks • 31 datasets

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Libraries

Use these libraries to find Video Retrieval models and implementations
5 papers
3,001
2 papers
29,301
See all 5 libraries.

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

whwu95/ATM 27 Nov 2023

In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video.

65
27 Nov 2023

VideoCon: Robust Video-Language Alignment via Contrast Captions

hritikbansal/videocon 15 Nov 2023

Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.

47
15 Nov 2023

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

renshuhuai-andy/testa 29 Oct 2023

TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding.

39
29 Oct 2023

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

wengzejia1/open-vclip 8 Oct 2023

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition.

91
08 Oct 2023

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval

huangmozhi9527/GMMFormer 8 Oct 2023

Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead.

10
08 Oct 2023

Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

intellabs/multimodal_cognitive_ai 7 Oct 2023

We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP).

35
07 Oct 2023

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

ninatu/howtocaption 7 Oct 2023

Specifically, we prompt an LLM to create plausible video descriptions based on ASR narrations of the video for a large-scale instructional video dataset.

21
07 Oct 2023

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

leolee99/pau NeurIPS 2023

In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.

17
29 Sep 2023

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

alipay/Ant-Multi-Modal-Framework 20 Sep 2023

We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs.

37
20 Sep 2023

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

ziyang412/ucofia ICCV 2023

Specifically, our model captures the cross-modal similarity information at different granularity levels.

40
18 Sep 2023