TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	MSR-VTT-1kA	PIDRo	text-to-video Mean Rank	10.7	# 5
Video Retrieval	MSR-VTT-1kA	PIDRo	text-to-video R@1	55.9	# 3
Video Retrieval	MSR-VTT-1kA	PIDRo	text-to-video R@5	79.8	# 4
Video Retrieval	MSR-VTT-1kA	PIDRo	text-to-video R@10	87.6	# 5
Video Retrieval	MSR-VTT-1kA	PIDRo	text-to-video Median Rank	1.0	# 1
Video Retrieval	MSR-VTT-1kA	PIDRo	video-to-text R@1	54.5	# 5
Video Retrieval	MSR-VTT-1kA	PIDRo	video-to-text R@5	78,3	# 22
Video Retrieval	MSR-VTT-1kA	PIDRo	video-to-text R@10	87.3	# 4
Video Retrieval	MSR-VTT-1kA	PIDRo	video-to-text Median Rank	1.0	# 1
Video Retrieval	MSR-VTT-1kA	PIDRo	video-to-text Mean Rank	7.5	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pidro-parallel-isomeric-attention-with/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=pidro-parallel-isomeric-attention-with)`

PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval

ICCV 2023 · Peiyan Guan, Renjing Pei, Bin Shao, Jianzhuang Liu, Weimian Li, Jiaxi Gu, Hang Xu, Songcen Xu, Youliang Yan, Edmund Y. Lam ·

Text-video retrieval is a fundamental task with high practical value in multi-modal research. Inspired by the great success of pre-trained image-text models with large-scale data, such as CLIP, many methods are proposed to transfer the strong representation learning capability of CLIP to text-video retrieval. However, due to the modality difference between videos and images, how to effectively adapt CLIP to the video domain is still underexplored. In this paper, we investigate this problem from two aspects. First, we enhance the transferred image encoder of CLIP for fine-grained video understanding in a seamless fashion. Second, we conduct fine-grained contrast between videos and texts from both model improvement and loss design. Particularly, we propose a fine-grained contrastive model equipped with parallel isomeric attention and dynamic routing, namely PIDRo, for text-video retrieval. The parallel isomeric attention module is used as the video encoder, which consists of two parallel branches modeling the spatial-temporal information of videos from both patch and frame levels. The dynamic routing module is constructed to enhance the text encoder of CLIP, generating informative word representations by distributing the fine-grained information to the related word tokens within a sentence. Such model design provides us with informative patch, frame and word representations. We then conduct token-wise interaction upon them. With the enhanced encoders and the token-wise loss, we are able to achieve finer-grained text-video alignment and more accurate retrieval. PIDRo obtains state-of-the-art performance over various text-video retrieval benchmarks, including MSR-VTT, MSVD, LSMDC, DiDeMo and ActivityNet.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Representation Learning

Retrieval

Sentence

Video Alignment

Video Retrieval

Video Understanding

Datasets

MSR-VTT

MSVD

DiDeMo

Results from the Paper

Add Remove

Ranked #3 on Video Retrieval on MSR-VTT-1kA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	MSR-VTT-1kA	PIDRo	text-to-video Mean Rank	10.7	# 5	Compare
			text-to-video R@1	55.9	# 3	Compare
			text-to-video R@5	79.8	# 4	Compare
			text-to-video R@10	87.6	# 5	Compare
			text-to-video Median Rank	1.0	# 1	Compare
			video-to-text R@1	54.5	# 5	Compare
			video-to-text R@5	78,3	# 22	Compare
			video-to-text R@10	87.3	# 4	Compare
			video-to-text Median Rank	1.0	# 1	Compare
			video-to-text Mean Rank	7.5	# 5	Compare

Methods

Add Remove

CLIP

Edit Social Preview

PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove