Video-Text Retrieval

47 papers with code • 1 benchmarks • 5 datasets

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Libraries

Use these libraries to find Video-Text Retrieval models and implementations
3 papers
3,014

Most implemented papers

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

tencentarc/mcq 26 Apr 2022

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

fundamentalvision/Uni-Perceiver 9 Jun 2022

To mitigate such interference, we introduce the Conditional Mixture-of-Experts (Conditional MoEs) to generalist models.

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

xuguohai/X-CLIP 15 Jul 2022

However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research.

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

computer-vision-in-the-wild/cvinw_readings 17 Oct 2022

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years.

VTC: Improving Video-Text Retrieval with User Comments

unitaryai/VTC 19 Oct 2022

In this paper, we a) introduce a new dataset of videos, titles and comments; b) present an attention-based mechanism that allows the model to learn from sometimes irrelevant data such as comments; c) show that by using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations.

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

iigroup/scl CVPR 2023

Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities.

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

shufangxun/MAC 2 Dec 2022

Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency.

Test of Time: Instilling Video-Language Models with a Sense of Time

bpiyush/TestOfTime CVPR 2023

Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.

MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

zhangbw17/mv-adapter 19 Jan 2023

The first is a Temporal Adaptation Module that is incorporated in the video branch to introduce global and local temporal contexts.