Video Retrieval

221 papers with code • 18 benchmarks • 31 datasets

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Libraries

Use these libraries to find Video Retrieval models and implementations
5 papers
3,003
2 papers
29,318
See all 5 libraries.

Latest papers with no code

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

no code yet • 1 Jan 2024

To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase.

No More Shortcuts: Realizing the Potential of Temporal Self-Supervision

no code yet • 20 Dec 2023

To address these issues, we propose 1) a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks and 2) an effective augmentation strategy to mitigate shortcuts.

WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge

no code yet • 15 Dec 2023

Text-video retrieval, a prominent sub-field within the domain of multimodal information retrieval, has witnessed remarkable growth in recent years.

Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning

no code yet • 10 Dec 2023

Orthogonal to the previous approaches to this limitation, we postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models.

Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval

no code yet • 1 Dec 2023

The zero-shot QASIR yields two discoveries: (1) it enables VLMs to generalize to super images and (2) the grid size $N$, image resolution, and VLM size are key trade-off parameters between performance and computation costs.

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding

no code yet • 30 Nov 2023

To do this, video-language models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel domains.

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

no code yet • 30 Nov 2023

To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.

E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

no code yet • 28 Nov 2023

Regardless of their effectiveness, larger architectures unavoidably prevent the models from being extended to real-world applications, so building a lightweight VL architecture and an efficient learning schema is of great practical value.

Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval

no code yet • 14 Nov 2023

A recent trend in multimodal retrieval is related to postprocessing test set results via the dual-softmax loss (DSL).

Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit Retrieval

no code yet • 3 Nov 2023

Fashion stylists have historically bridged the gap between consumers' desires and perfect outfits, which involve intricate combinations of colors, patterns, and materials.