Video Retrieval
221 papers with code • 18 benchmarks • 31 datasets
The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.
Libraries
Use these libraries to find Video Retrieval models and implementationsSubtasks
Latest papers
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video.
VideoCon: Robust Video-Language Alignment via Contrast Captions
Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding.
Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data
Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition.
GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval
Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead.
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP).
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Specifically, we prompt an LLM to create plausible video descriptions based on ASR narrations of the video for a large-scale instructional video dataset.
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs.
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Specifically, our model captures the cross-modal similarity information at different granularity levels.