TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	ActivityNet	CAMoE	text-to-video R@1	51.0	# 14
Video Retrieval	ActivityNet	CAMoE	text-to-video R@5	77.7	# 12
Video Retrieval	ActivityNet	CAMoE	text-to-video R@10	87.6	# 11
Video Retrieval	ActivityNet	CAMoE	text-to-video Median Rank	1	# 1
Video Retrieval	ActivityNet	CAMoE	text-to-video Mean Rank	6.3	# 6
Video Retrieval	DiDeMo	CAMoE	text-to-video R@1	43.8	# 31
Video Retrieval	DiDeMo	CAMoE	text-to-video R@5	71.4	# 29
Video Retrieval	DiDeMo	CAMoE	text-to-video R@10	79.9	# 30
Video Retrieval	DiDeMo	CAMoE	text-to-video Median Rank	2.0	# 9
Video Retrieval	DiDeMo	CAMoE	text-to-video Mean Rank	16.3	# 11
Video Retrieval	DiDeMo	CAMoE	video-to-text R@1	45.5	# 14
Video Retrieval	DiDeMo	CAMoE	video-to-text R@10	80.5	# 11
Video Retrieval	DiDeMo	CAMoE	video-to-text Median Rank	2	# 5
Video Retrieval	DiDeMo	CAMoE	video-to-text Mean Rank	10.2	# 7
Video Retrieval	LSMDC	CAMoE	text-to-video R@1	25.9	# 15
Video Retrieval	LSMDC	CAMoE	text-to-video R@5	46.1	# 13
Video Retrieval	LSMDC	CAMoE	text-to-video R@10	53.7	# 15
Video Retrieval	LSMDC	CAMoE	text-to-video Mean Rank	54.4	# 7
Video Retrieval	MSR-VTT	CAMoE	text-to-video R@1	32.9	# 21
Video Retrieval	MSR-VTT	CAMoE	text-to-video R@5	58.3	# 19
Video Retrieval	MSR-VTT	CAMoE	text-to-video R@10	68.4	# 19
Video Retrieval	MSR-VTT	CAMoE	text-to-video Mean Rank	42.6	# 2
Video Retrieval	MSR-VTT	CAMoE	text-to-video Median Rank	3	# 1
Video Retrieval	MSR-VTT	CAMoE	video-to-text R@1	59.8	# 3
Video Retrieval	MSR-VTT	CAMoE	video-to-text R@5	86.2	# 1
Video Retrieval	MSR-VTT	CAMoE	video-to-text R@10	92.8	# 1
Video Retrieval	MSR-VTT	CAMoE	video-to-text Median Rank	1	# 1
Video Retrieval	MSR-VTT	CAMoE	video-to-text Mean Rank	3.8	# 1
Video Retrieval	MSR-VTT-1kA	CAMoE	text-to-video Mean Rank	12.4	# 11
Video Retrieval	MSR-VTT-1kA	CAMoE	text-to-video R@1	48.8	# 22
Video Retrieval	MSR-VTT-1kA	CAMoE	text-to-video R@5	75.6	# 16
Video Retrieval	MSR-VTT-1kA	CAMoE	text-to-video R@10	85.3	# 11
Video Retrieval	MSR-VTT-1kA	CAMoE	text-to-video Median Rank	2	# 10
Video Retrieval	MSR-VTT-1kA	CAMoE	video-to-text R@1	50.3	# 8
Video Retrieval	MSR-VTT-1kA	CAMoE	video-to-text R@5	74.6	# 11
Video Retrieval	MSR-VTT-1kA	CAMoE	video-to-text R@10	83.8	# 14
Video Retrieval	MSR-VTT-1kA	CAMoE	video-to-text Median Rank	2	# 7
Video Retrieval	MSR-VTT-1kA	CAMoE	video-to-text Mean Rank	9.9	# 16
Video Retrieval	MSVD	CAMoE	text-to-video R@1	51.8	# 9
Video Retrieval	MSVD	CAMoE	text-to-video R@5	87.6	# 1
Video Retrieval	MSVD	CAMoE	text-to-video R@10	87.6	# 9
Video Retrieval	MSVD	CAMoE	text-to-video Median Rank	1	# 1
Video Retrieval	MSVD	CAMoE	text-to-video Mean Rank	8.9	# 8
Video Retrieval	MSVD	CAMoE	video-to-text R@1	69.3	# 6
Video Retrieval	MSVD	CAMoE	video-to-text R@5	90.6	# 6
Video Retrieval	MSVD	CAMoE	video-to-text R@10	94.6	# 7
Video Retrieval	MSVD	CAMoE	video-to-text Median Rank	1	# 1
Video Retrieval	MSVD	CAMoE	video-to-text Mean Rank	3.1	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-video-text-retrieval-by-multi/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=improving-video-text-retrieval-by-multi)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-video-text-retrieval-by-multi/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=improving-video-text-retrieval-by-multi)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-video-text-retrieval-by-multi/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=improving-video-text-retrieval-by-multi)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-video-text-retrieval-by-multi/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=improving-video-text-retrieval-by-multi)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-video-text-retrieval-by-multi/video-retrieval-on-msr-vtt-1ka)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka?p=improving-video-text-retrieval-by-multi)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-video-text-retrieval-by-multi/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=improving-video-text-retrieval-by-multi)`

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

9 Sep 2021 · Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen ·

Employing large-scale pre-trained model CLIP to conduct video-text retrieval task (VTR) has become a new trend, which exceeds previous VTR methods. Though, due to the heterogeneity of structures and contents between video and text, previous CLIP-based models are prone to overfitting in the training phase, resulting in relatively poor retrieval performance. In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. In this stage, we conduct massive explorations towards the feature extraction module and feature alignment module. DSL is proposed to avoid the one-way optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match. DSL is easy to implement with only one-line code but improves significantly. The results show that the proposed CAMoE and DSL are of strong efficiency, and each of them is capable of achieving State-of-The-Art (SOTA) individually on various benchmarks such as MSR-VTT, MSVD, and LSMDC. Further, with both of them, the performance is advanced to a big extend, surpassing the previous SOTA methods for around 4.6\% R@1 in MSR-VTT.

PDF Abstract

Code

Add Remove Mark official

starmemda/camow official

starmemda/CAMoE official

Tasks

Add Remove

Retrieval

Text Retrieval

Video Retrieval

Video-Text Retrieval

Datasets

ActivityNet

MSR-VTT

MSVD

DiDeMo

LSMDC

Results from the Paper

Edit

Ranked #9 on Video Retrieval on MSVD (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	ActivityNet	CAMoE	text-to-video R@1	51.0	# 14	Compare
			text-to-video R@5	77.7	# 12	Compare
			text-to-video R@10	87.6	# 11	Compare
			text-to-video Median Rank	1	# 1	Compare
			text-to-video Mean Rank	6.3	# 6	Compare
Video Retrieval	DiDeMo	CAMoE	text-to-video R@1	43.8	# 31	Compare
			text-to-video R@5	71.4	# 29	Compare
			text-to-video R@10	79.9	# 30	Compare
			text-to-video Median Rank	2.0	# 9	Compare
			text-to-video Mean Rank	16.3	# 11	Compare
			video-to-text R@1	45.5	# 14	Compare
			video-to-text R@10	80.5	# 11	Compare
			video-to-text Median Rank	2	# 5	Compare
			video-to-text Mean Rank	10.2	# 7	Compare
Video Retrieval	LSMDC	CAMoE	text-to-video R@1	25.9	# 15	Compare
			text-to-video R@5	46.1	# 13	Compare
			text-to-video R@10	53.7	# 15	Compare
			text-to-video Mean Rank	54.4	# 7	Compare
Video Retrieval	MSR-VTT	CAMoE	text-to-video R@1	32.9	# 21	Compare
			text-to-video R@5	58.3	# 19	Compare
			text-to-video R@10	68.4	# 19	Compare
			text-to-video Mean Rank	42.6	# 2	Compare
			text-to-video Median Rank	3	# 1	Compare
			video-to-text R@1	59.8	# 3	Compare
			video-to-text R@5	86.2	# 1	Compare
			video-to-text R@10	92.8	# 1	Compare
			video-to-text Median Rank	1	# 1	Compare
			video-to-text Mean Rank	3.8	# 1	Compare
Video Retrieval	MSR-VTT-1kA	CAMoE	text-to-video Mean Rank	12.4	# 11	Compare
			text-to-video R@1	48.8	# 22	Compare
			text-to-video R@5	75.6	# 16	Compare
			text-to-video R@10	85.3	# 11	Compare
			text-to-video Median Rank	2	# 10	Compare
			video-to-text R@1	50.3	# 8	Compare
			video-to-text R@5	74.6	# 11	Compare
			video-to-text R@10	83.8	# 14	Compare
			video-to-text Median Rank	2	# 7	Compare
			video-to-text Mean Rank	9.9	# 16	Compare
Video Retrieval	MSVD	CAMoE	text-to-video R@1	51.8	# 9	Compare
			text-to-video R@5	87.6	# 1	Compare
			text-to-video R@10	87.6	# 9	Compare
			text-to-video Median Rank	1	# 1	Compare
			text-to-video Mean Rank	8.9	# 8	Compare
			video-to-text R@1	69.3	# 6	Compare
			video-to-text R@5	90.6	# 6	Compare
			video-to-text R@10	94.6	# 7	Compare
			video-to-text Median Rank	1	# 1	Compare
			video-to-text Mean Rank	3.1	# 4	Compare

Methods

Add Remove

Adam • Attention Dropout • BERT • CAMoE • CLIP • Dense Connections • Dropout • Dual Softmax Loss • GELU • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer • Weight Decay • WordPiece

Edit Social Preview

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove