TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	ActivityNet	DMAE (ViT-B/32)	text-to-video R@1	53.4	# 13
Video Retrieval	ActivityNet	DMAE (ViT-B/32)	text-to-video R@5	80.7	# 10
Video Retrieval	ActivityNet	DMAE (ViT-B/32)	text-to-video R@10	89.2	# 10
Video Retrieval	ActivityNet	DMAE (ViT-B/32)	text-to-video Median Rank	1.0	# 1
Video Retrieval	ActivityNet	DMAE (ViT-B/32)	text-to-video Mean Rank	5.3	# 4
Video Retrieval	DiDeMo	DMAE (ViT-B/32)	text-to-video R@1	52.7	# 18
Video Retrieval	DiDeMo	DMAE (ViT-B/32)	text-to-video R@5	79.3	# 15
Video Retrieval	DiDeMo	DMAE (ViT-B/32)	text-to-video R@10	86.6	# 14
Video Retrieval	DiDeMo	DMAE (ViT-B/32)	text-to-video Median Rank	1.0	# 1
Video Retrieval	DiDeMo	DMAE (ViT-B/32)	text-to-video Mean Rank	10.5	# 1
Video Retrieval	MSR-VTT-1kA	DMAE (ViT-B/16)	text-to-video Mean Rank	10.0	# 4
Video Retrieval	MSR-VTT-1kA	DMAE (ViT-B/16)	text-to-video R@1	55.5	# 4
Video Retrieval	MSR-VTT-1kA	DMAE (ViT-B/16)	text-to-video R@5	79.4	# 6
Video Retrieval	MSR-VTT-1kA	DMAE (ViT-B/16)	text-to-video R@10	87.1	# 8
Video Retrieval	MSR-VTT-1kA	DMAE (ViT-B/16)	text-to-video Median Rank	1.0	# 1
Video Retrieval	MSR-VTT-1kA	DMAE (ViT-B/16)	video-to-text R@1	55.7	# 3
Video Retrieval	MSR-VTT-1kA	DMAE (ViT-B/16)	video-to-text R@5	79.2	# 4
Video Retrieval	MSR-VTT-1kA	DMAE (ViT-B/16)	video-to-text R@10	87.2	# 5
Video Retrieval	MSR-VTT-1kA	DMAE (ViT-B/16)	video-to-text Median Rank	1.0	# 1
Video Retrieval	MSR-VTT-1kA	DMAE (ViT-B/16)	video-to-text Mean Rank	7.3	# 4
Video Retrieval	MSVD	DMAE (ViT-B/32)	text-to-video R@1	48.7	# 13
Video Retrieval	MSVD	DMAE (ViT-B/32)	text-to-video R@5	78.4	# 11
Video Retrieval	MSVD	DMAE (ViT-B/32)	text-to-video R@10	86.3	# 10
Video Retrieval	MSVD	DMAE (ViT-B/32)	text-to-video Median Rank	2.0	# 8
Video Retrieval	MSVD	DMAE (ViT-B/32)	text-to-video Mean Rank	9.8	# 11

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dual-modal-attention-enhanced-text-video/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=dual-modal-attention-enhanced-text-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dual-modal-attention-enhanced-text-video/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=dual-modal-attention-enhanced-text-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dual-modal-attention-enhanced-text-video/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=dual-modal-attention-enhanced-text-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dual-modal-attention-enhanced-text-video/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=dual-modal-attention-enhanced-text-video)`

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

20 Sep 2023 · Chen Jiang, Hong Liu, Xuzheng Yu, Qing Wang, Yuan Cheng, Jia Xu, Zhongyi Liu, Qingpei Guo, Wei Chu, Ming Yang, Yuan Qi ·

In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.

PDF Abstract

Code

Add Remove Mark official

alipay/Ant-Multi-Modal-Framework official

Tasks

Add Remove

Contrastive Learning

Retrieval

Semantic Similarity

Semantic Textual Similarity

Video Retrieval

Datasets

ActivityNet

MSR-VTT

MSVD

DiDeMo

Results from the Paper

Edit

Ranked #4 on Video Retrieval on MSR-VTT-1kA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	ActivityNet	DMAE (ViT-B/32)	text-to-video R@1	53.4	# 13	Compare
			text-to-video R@5	80.7	# 10	Compare
			text-to-video R@10	89.2	# 10	Compare
			text-to-video Median Rank	1.0	# 1	Compare
			text-to-video Mean Rank	5.3	# 4	Compare
Video Retrieval	DiDeMo	DMAE (ViT-B/32)	text-to-video R@1	52.7	# 18	Compare
			text-to-video R@5	79.3	# 15	Compare
			text-to-video R@10	86.6	# 14	Compare
			text-to-video Median Rank	1.0	# 1	Compare
			text-to-video Mean Rank	10.5	# 1	Compare
Video Retrieval	MSR-VTT-1kA	DMAE (ViT-B/16)	text-to-video Mean Rank	10.0	# 4	Compare
			text-to-video R@1	55.5	# 4	Compare
			text-to-video R@5	79.4	# 6	Compare
			text-to-video R@10	87.1	# 8	Compare
			text-to-video Median Rank	1.0	# 1	Compare
			video-to-text R@1	55.7	# 3	Compare
			video-to-text R@5	79.2	# 4	Compare
			video-to-text R@10	87.2	# 5	Compare
			video-to-text Median Rank	1.0	# 1	Compare
			video-to-text Mean Rank	7.3	# 4	Compare
Video Retrieval	MSVD	DMAE (ViT-B/32)	text-to-video R@1	48.7	# 13	Compare
			text-to-video R@5	78.4	# 11	Compare
			text-to-video R@10	86.3	# 10	Compare
			text-to-video Median Rank	2.0	# 8	Compare
			text-to-video Mean Rank	9.8	# 11	Compare

Methods

Add Remove

Contrastive Learning • Focus • InfoNCE

Edit Social Preview

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove