Video-Text Retrieval

47 papers with code • 1 benchmarks • 5 datasets

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video-Text Retrieval

Trend	Dataset	Best Model	Paper	Code	Compare
	Test-of-Time	TACT			See all

Libraries

Use these libraries to find Video-Text Retrieval models and implementations

towhee-io/towhee

3 papers

2,996

Datasets

Most implemented papers

Most implemented Social Latest No code

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

niluthpol/multimodal_vtt • • ICMR 2018

Constructing a joint representation invariant across different modalities (e. g., video, language) is of significant importance in many multimedia applications.

Paper
Code

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

yalesong/pvse • • CVPR 2019

In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.

Paper
Code

Retrieving and Highlighting Action with Spatiotemporal Reference

yiskw713/yiskw713 • • 19 May 2020

In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods.

Paper
Code

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

gingsi/coot-videotext • • NeurIPS 2020

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics.

Paper
Code

Learning the Best Pooling Strategy for Visual Semantic Embedding

woodfrog/vse_infty • • CVPR 2021

Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions.

Paper
Code

Rudder: A Cross Lingual Video and Text Retrieval Dataset

nshubham655/RUDDER • • 9 Mar 2021

Video retrieval using natural language queries requires learning semantically meaningful joint embeddings between the text and the audio-visual input.

Paper
Code

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

CryhanFang/CLIP2Video • • 21 Jun 2021

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner.

Paper
Code

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

Roc-Ng/HANet • • 26 Jul 2021

Based on these, we naturally construct hierarchical representations in the individual-local-global manner, where the individual level focuses on the alignment between frame and word, local level focuses on the alignment between video clip and textual context, and global level focuses on the alignment between the whole video and text.

Paper
Code

Video-Text Pre-training with Learned Regions

showlab/region_learner • • 2 Dec 2021

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information.

Paper
Code

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

layer6ai-labs/xpool • • CVPR 2022

Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos.

Paper
Code

Video-Text Retrieval

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result