Video Retrieval
221 papers with code • 18 benchmarks • 31 datasets
The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.
Libraries
Use these libraries to find Video Retrieval models and implementationsSubtasks
Latest papers
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval.
Composed Video Retrieval via Enriched Context and Discriminative Embeddings
Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases.
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training.
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
Multi-granularity Correspondence Learning from Long-term Noisy Videos
Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos.
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
(2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap.
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
A human need to capture both the event in every shot and associate them together to understand the story behind it.
Let All be Whitened: Multi-teacher Distillation for Efficient Visual Retrieval
Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval.
RTQ: Rethinking Video-language Understanding Based on Image-text Model
Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos.