Video Retrieval
221 papers with code • 18 benchmarks • 31 datasets
The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.
Libraries
Use these libraries to find Video Retrieval models and implementationsSubtasks
Most implemented papers
Hashing with Mutual Information
Binary vector embeddings enable fast nearest neighbor retrieval in large databases of high-dimensional objects, and play an important role in many practical applications, such as image and video retrieval.
Person Search in Videos with One Portrait Through Visual and Temporal Links
In real-world applications, e. g. law enforcement and video retrieval, one often needs to search a certain person in long videos with just one portrait.
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e. g. a video clip and a language sentence).
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks.
Targeted Attack for Deep Hashing based Retrieval
In this paper, we propose a novel method, dubbed deep hashing targeted attack (DHTA), to study the targeted attack on such retrieval.
Hysia: Serving DNN-Based Video-to-Retail Applications in Cloud
Combining \underline{v}ideo streaming and online \underline{r}etailing (V2R) has been a growing trend recently.
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework
With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations.
Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics
Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.