Video-Text Retrieval
47 papers with code • 1 benchmarks • 5 datasets
Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.
Libraries
Use these libraries to find Video-Text Retrieval models and implementationsMost implemented papers
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
Constructing a joint representation invariant across different modalities (e. g., video, language) is of significant importance in many multimedia applications.
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.
Retrieving and Highlighting Action with Spatiotemporal Reference
In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods.
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics.
Learning the Best Pooling Strategy for Visual Semantic Embedding
Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions.
Rudder: A Cross Lingual Video and Text Retrieval Dataset
Video retrieval using natural language queries requires learning semantically meaningful joint embeddings between the text and the audio-visual input.
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner.
HANet: Hierarchical Alignment Networks for Video-Text Retrieval
Based on these, we naturally construct hierarchical representations in the individual-local-global manner, where the individual level focuses on the alignment between frame and word, local level focuses on the alignment between video clip and textual context, and global level focuses on the alignment between the whole video and text.
Video-Text Pre-training with Learned Regions
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information.
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos.