Video Retrieval
221 papers with code • 18 benchmarks • 31 datasets
The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.
Libraries
Use these libraries to find Video Retrieval models and implementationsSubtasks
Latest papers with no code
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap.
ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
Text-video retrieval aims to find the most relevant cross-modal samples for a given query.
Event-aware Video Corpus Moment Retrieval
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos using the natural language query.
Video Editing for Video Retrieval
The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips.
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text.
Distilling Vision-Language Models on Millions of Videos
Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.
Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks
Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
Detours for Navigating Instructional Videos
We introduce the video detours problem for navigating instructional videos.
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase.
No More Shortcuts: Realizing the Potential of Temporal Self-Supervision
To address these issues, we propose 1) a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks and 2) an effective augmentation strategy to mitigate shortcuts.