TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Captioning	MSR-VTT	IcoCap (ViT-B/32)	CIDEr	59.1	# 15
Video Captioning	MSR-VTT	IcoCap (ViT-B/32)	METEOR	30.3	# 11
Video Captioning	MSR-VTT	IcoCap (ViT-B/32)	ROUGE-L	64.3	# 11
Video Captioning	MSR-VTT	IcoCap (ViT-B/32)	BLEU-4	46.1	# 14
Video Captioning	MSR-VTT	IcoCap (ViT-B/16)	CIDEr	60.2	# 13
Video Captioning	MSR-VTT	IcoCap (ViT-B/16)	METEOR	31.1	# 7
Video Captioning	MSR-VTT	IcoCap (ViT-B/16)	ROUGE-L	64.9	# 8
Video Captioning	MSR-VTT	IcoCap (ViT-B/16)	BLEU-4	47.0	# 12
Video Captioning	MSVD	IcoCap (ViT-B/16)	CIDEr	110.3	# 12
Video Captioning	MSVD	IcoCap (ViT-B/16)	BLEU-4	59.1	# 9
Video Captioning	MSVD	IcoCap (ViT-B/16)	METEOR	39.5	# 8
Video Captioning	MSVD	IcoCap (ViT-B/16)	ROUGE-L	76.5	# 8
Video Captioning	MSVD	IcoCap (ViT-B/32)	CIDEr	103.8	# 14
Video Captioning	MSVD	IcoCap (ViT-B/32)	BLEU-4	56.3	# 10
Video Captioning	MSVD	IcoCap (ViT-B/32)	METEOR	38.9	# 10
Video Captioning	MSVD	IcoCap (ViT-B/32)	ROUGE-L	75.0	# 9
Video Captioning	VATEX	IcoCap (ViT-B/16)	BLEU-4	37.4	# 5
Video Captioning	VATEX	IcoCap (ViT-B/16)	CIDEr	67.8	# 5
Video Captioning	VATEX	IcoCap (ViT-B/16)	METEOR	25.7	# 2
Video Captioning	VATEX	IcoCap (ViT-B/16)	ROUGE-L	53.1	# 3
Video Captioning	VATEX	IcoCap (ViT-B/32)	BLEU-4	36.9	# 6
Video Captioning	VATEX	IcoCap (ViT-B/32)	CIDEr	63.4	# 8
Video Captioning	VATEX	IcoCap (ViT-B/32)	METEOR	24.6	# 5
Video Captioning	VATEX	IcoCap (ViT-B/32)	ROUGE-L	52.5	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/icocap-improving-video-captioning-by/video-captioning-on-vatex-1)](https://paperswithcode.com/sota/video-captioning-on-vatex-1?p=icocap-improving-video-captioning-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/icocap-improving-video-captioning-by/video-captioning-on-msvd-1)](https://paperswithcode.com/sota/video-captioning-on-msvd-1?p=icocap-improving-video-captioning-by)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/icocap-improving-video-captioning-by/video-captioning-on-msr-vtt-1)](https://paperswithcode.com/sota/video-captioning-on-msr-vtt-1?p=icocap-improving-video-captioning-by)`

IcoCap: Improving Video Captioning by Compounding Images

IEEE Transactions on Multimedia 2023 · Yuanzhi Liang, Linchao Zhu, Xiaohan Wang, Yi Yang ·

Video captioning is a more challenging task compared to image captioning, primarily due to differences in content density. Video data contains redundant visual content, making it difficult for captioners to generalize diverse content and avoid being misled by irrelevant elements. Moreover, redundant content is not well-trimmed to match the corresponding visual semantics in the ground truth, further increasing the difficulty of video captioning. Current research in video captioning predominantly focuses on captioner design, neglecting the impact of content density on captioner performance. Considering the differences between videos and images, there exists another line to improve video captioning by leveraging concise and easily-learned image samples to further diversify video samples. This modification to content density compels the captioner to learn more effectively against redundancy and ambiguity. In this paper, we propose a novel approach called Image-Compounded learning for video Captioners (IcoCap) to facilitate better learning of complex video semantics. IcoCap comprises two components: the Image-Video Compounding Strategy (ICS) and Visual-Semantic Guided Captioning (VGC). ICS compounds easily-learned image semantics into video semantics, further diversifying video content and prompting the network to generalize contents in a more diverse sample. Besides, learning with the sample compounded with image contents, the captioner is compelled to better extract valuable video cues in the presence of straightforward image semantics. This helps the captioner further focus on relevant information while filtering out extraneous content. Then, VGC guides the network in flexibly learning ground truth captions based on the compounded samples, helping to mitigate the mismatch between the ground truth and ambiguous semantics in video samples. Our experimental results demonstrate the effectiveness of IcoCap in improving the learning of video captioners. Applied to the widely-used MSVD, MSR-VTT, and VATEX datasets, our approach achieves competitive or superior results compared to state-of-the-art methods, illustrating its capacity to handle redundant and ambiguous video data.

PDF

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Image Captioning

Video Captioning

Datasets

MSR-VTT

MSVD

VATEX

Results from the Paper

Add Remove

Ranked #5 on Video Captioning on VATEX (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Captioning	MSR-VTT	IcoCap (ViT-B/32)	CIDEr	59.1	# 15	Compare
			METEOR	30.3	# 11	Compare
			ROUGE-L	64.3	# 11	Compare
			BLEU-4	46.1	# 14	Compare
Video Captioning	MSR-VTT	IcoCap (ViT-B/16)	CIDEr	60.2	# 13	Compare
			METEOR	31.1	# 7	Compare
			ROUGE-L	64.9	# 8	Compare
			BLEU-4	47.0	# 12	Compare
Video Captioning	MSVD	IcoCap (ViT-B/16)	CIDEr	110.3	# 12	Compare
			BLEU-4	59.1	# 9	Compare
			METEOR	39.5	# 8	Compare
			ROUGE-L	76.5	# 8	Compare
Video Captioning	MSVD	IcoCap (ViT-B/32)	CIDEr	103.8	# 14	Compare
			BLEU-4	56.3	# 10	Compare
			METEOR	38.9	# 10	Compare
			ROUGE-L	75.0	# 9	Compare
Video Captioning	VATEX	IcoCap (ViT-B/16)	BLEU-4	37.4	# 5	Compare
			CIDEr	67.8	# 5	Compare
			METEOR	25.7	# 2	Compare
			ROUGE-L	53.1	# 3	Compare
Video Captioning	VATEX	IcoCap (ViT-B/32)	BLEU-4	36.9	# 6	Compare
			CIDEr	63.4	# 8	Compare
			METEOR	24.6	# 5	Compare
			ROUGE-L	52.5	# 4	Compare

Methods

Add Remove

Focus

Edit Social Preview

IcoCap: Improving Video Captioning by Compounding Images

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove