Video-Text Retrieval

47 papers with code • 1 benchmarks • 5 datasets

Video-Text retrieval requires understanding of both video and language together. Therefore it's different to video retrieval task.

Libraries

Use these libraries to find Video-Text Retrieval models and implementations
3 papers
3,037

Most implemented papers

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

farewellthree/stan CVPR 2023

In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain.

Video-Text Retrieval by Supervised Sparse Multi-Grained Learning

yimuwangcs/Better_Cross_Modal_Retrieval 19 Feb 2023

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-grained sparse learning framework, S3MA, to learn an aligned sparse space shared between the video and the text for video-text retrieval.

CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning

FangyunWei/SLRT CVPR 2023

Our framework, termed as domain-aware sign language retrieval via Cross-lingual Contrastive learning or CiCo for short, outperforms the pioneering method by large margins on various datasets, e. g., +22. 4 T2V and +28. 0 V2T R@1 improvements on How2Sign dataset, and +13. 7 T2V and +17. 1 V2T R@1 improvements on PHOENIX-2014T dataset.

SViTT: Temporal Learning of Sparse Video-Text Transformers

jerryyli/svitt CVPR 2023

Do video-text transformers learn to model temporal relationships across frames?

Global and Local Semantic Completion Learning for Vision-Language Pre-training

iigroup/scl 12 Jun 2023

MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data.

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

chuhanxx/helping_hand_for_egocentric_videos ICCV 2023

We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i. e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e. g. Episodic Memory in Ego4D).

Multi-event Video-Text Retrieval

gengyuanmax/mevtr ICCV 2023

In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task.

UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory

Paranioar/UniPT 28 Aug 2023

Parameter-efficient transfer learning (PETL), i. e., fine-tuning a small portion of parameters, is an effective strategy for adapting pre-trained models to downstream domains.

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

ziyang412/ucofia ICCV 2023

Specifically, our model captures the cross-modal similarity information at different granularity levels.

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

leolee99/pau NeurIPS 2023

In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.