Video-Text Retrieval

47 papers with code • 1 benchmarks • 5 datasets

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video-Text Retrieval

Trend	Dataset	Best Model	Paper	Code	Compare
	Test-of-Time	TACT			See all

Libraries

Use these libraries to find Video-Text Retrieval models and implementations

towhee-io/towhee

3 papers

3,014

Datasets

Most implemented papers

Most implemented Social Latest No code

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

tencentarc/mcq • • 26 Apr 2022

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.

Paper
Code

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

fundamentalvision/Uni-Perceiver • • 9 Jun 2022

To mitigate such interference, we introduce the Conditional Mixture-of-Experts (Conditional MoEs) to generalist models.

Paper
Code

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

xuguohai/X-CLIP • • 15 Jul 2022

However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research.

Paper
Code

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

microsoft/xpretrain • • 14 Sep 2022

and 2) how to mitigate the impact of these factors?

Paper
Code

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

computer-vision-in-the-wild/cvinw_readings • 17 Oct 2022

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years.

Paper
Code

VTC: Improving Video-Text Retrieval with User Comments

unitaryai/VTC • • 19 Oct 2022

In this paper, we a) introduce a new dataset of videos, titles and comments; b) present an attention-based mechanism that allows the model to learn from sometimes irrelevant data such as comments; c) show that by using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations.

Paper
Code

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

iigroup/scl • • CVPR 2023

Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities.

Paper
Code

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

shufangxun/MAC • 2 Dec 2022

Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency.

Paper
Code

Test of Time: Instilling Video-Language Models with a Sense of Time

bpiyush/TestOfTime • • CVPR 2023

Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.

Paper
Code

MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

zhangbw17/mv-adapter • • 19 Jan 2023

The first is a Temporal Adaptation Module that is incorporated in the video branch to introduce global and local temporal contexts.

Paper
Code

Video-Text Retrieval

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result