Video-Text Retrieval
47 papers with code • 1 benchmarks • 5 datasets
Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.
Libraries
Use these libraries to find Video-Text Retrieval models and implementationsMost implemented papers
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
To mitigate such interference, we introduce the Conditional Mixture-of-Experts (Conditional MoEs) to generalist models.
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research.
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
and 2) how to mitigate the impact of these factors?
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years.
VTC: Improving Video-Text Retrieval with User Comments
In this paper, we a) introduce a new dataset of videos, titles and comments; b) present an attention-based mechanism that allows the model to learn from sometimes irrelevant data such as comments; c) show that by using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations.
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities.
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency.
Test of Time: Instilling Video-Language Models with a Sense of Time
Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.
MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval
The first is a Temporal Adaptation Module that is incorporated in the video branch to introduce global and local temporal contexts.