Video Retrieval

221 papers with code • 18 benchmarks • 31 datasets

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Libraries

Use these libraries to find Video Retrieval models and implementations
5 papers
2,987
2 papers
29,251
See all 5 libraries.

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

jiamian-wang/t-mass-text-video-retrieval 26 Mar 2024

Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval.

22
26 Mar 2024

Composed Video Retrieval via Enriched Context and Discriminative Embeddings

omkarthawakar/composed-video-retrieval 25 Mar 2024

Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases.

26
25 Mar 2024

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

917
22 Mar 2024

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

mlvlab/vid-tldr 20 Mar 2024

To tackle these issues, we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training.

15
20 Mar 2024

Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement

hdy007007/prem 21 Feb 2024

We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.

2
21 Feb 2024

Multi-granularity Correspondence Learning from Long-term Noisy Videos

XLearning-SCU/2024-ICLR-Norton 30 Jan 2024

Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos.

90
30 Jan 2024

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

knightyxp/dgl 19 Jan 2024

(2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap.

17
19 Jan 2024

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

bytedance/Shot2Story 16 Dec 2023

A human need to capture both the event in every shot and associate them together to understand the story behind it.

45
16 Dec 2023

Let All be Whitened: Multi-teacher Distillation for Efficient Visual Retrieval

maryeon/whiten_mtd 15 Dec 2023

Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval.

2
15 Dec 2023

RTQ: Rethinking Video-language Understanding Based on Image-text Model

SCZwangxiao/RTQ-MM2023 1 Dec 2023

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos.

10
01 Dec 2023