Video-Text Retrieval

47 papers with code • 1 benchmarks • 5 datasets

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Libraries

Use these libraries to find Video-Text Retrieval models and implementations
3 papers
2,996

Most implemented papers

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

niluthpol/multimodal_vtt ICMR 2018

Constructing a joint representation invariant across different modalities (e. g., video, language) is of significant importance in many multimedia applications.

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

yalesong/pvse CVPR 2019

In this work, we introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.

Retrieving and Highlighting Action with Spatiotemporal Reference

yiskw713/yiskw713 19 May 2020

In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods.

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

gingsi/coot-videotext NeurIPS 2020

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics.

Learning the Best Pooling Strategy for Visual Semantic Embedding

woodfrog/vse_infty CVPR 2021

Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions.

Rudder: A Cross Lingual Video and Text Retrieval Dataset

nshubham655/RUDDER 9 Mar 2021

Video retrieval using natural language queries requires learning semantically meaningful joint embeddings between the text and the audio-visual input.

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

CryhanFang/CLIP2Video 21 Jun 2021

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner.

HANet: Hierarchical Alignment Networks for Video-Text Retrieval

Roc-Ng/HANet 26 Jul 2021

Based on these, we naturally construct hierarchical representations in the individual-local-global manner, where the individual level focuses on the alignment between frame and word, local level focuses on the alignment between video clip and textual context, and global level focuses on the alignment between the whole video and text.

Video-Text Pre-training with Learned Regions

showlab/region_learner 2 Dec 2021

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information.

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

layer6ai-labs/xpool CVPR 2022

Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos.