TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Question Answer	MSRVTT-QA	SUM-shot+Vicuna	Accuracy	56.8	# 10
video narration captioning	Shot2Story20K	Ours	METEOR	24.8	# 1
video narration captioning	Shot2Story20K	Ours	ROUGE	39	# 1
video narration captioning	Shot2Story20K	Ours	BLEU-4	18.8	# 1
video narration captioning	Shot2Story20K	Ours	CIDEr	168.7	# 1
Video Captioning	Shot2Story20K	Ours	CIDEr	37.4	# 1
Video Captioning	Shot2Story20K	Ours	METEOR	16.2	# 1
Video Captioning	Shot2Story20K	Ours	ROUGE	29.6	# 1
Video Captioning	Shot2Story20K	Ours	BLEU-4	10.7	# 1
Video Summarization	Shot2Story20K	SUM-shot	CIDEr	8.6	# 1
Video Summarization	Shot2Story20K	SUM-shot	BLEU-4	11.7	# 1
Video Summarization	Shot2Story20K	SUM-shot	METEOR	19.7	# 1
Video Summarization	Shot2Story20K	SUM-shot	ROUGE	26.8	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shot2story20k-a-new-benchmark-for/video-narration-captioning-on-shot2story20k)](https://paperswithcode.com/sota/video-narration-captioning-on-shot2story20k?p=shot2story20k-a-new-benchmark-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shot2story20k-a-new-benchmark-for/video-captioning-on-shot2story20k)](https://paperswithcode.com/sota/video-captioning-on-shot2story20k?p=shot2story20k-a-new-benchmark-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shot2story20k-a-new-benchmark-for/video-summarization-on-shot2story20k)](https://paperswithcode.com/sota/video-summarization-on-shot2story20k?p=shot2story20k-a-new-benchmark-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shot2story20k-a-new-benchmark-for/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=shot2story20k-a-new-benchmark-for)`

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

16 Dec 2023 · Mingfei Han, Linjie Yang, Xiaojun Chang, Heng Wang ·

A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.

PDF Abstract

Code

Add Remove Mark official

bytedance/Shot2Story official

↳ Quickstart in

Spaces

Tasks

Add Remove

Video Captioning

video narration captioning

Video Question Answering

Video Retrieval

Video Summarization

Video Understanding

Zero-Shot Video Question Answer

Datasets

Introduced in the Paper:

Shot2Story20K

Used in the Paper:

MSR-VTT

ActivityNet-QA MSRVTT-QA

Results from the Paper

Add Remove

Ranked #1 on video narration captioning on Shot2Story20K

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	MSRVTT-QA	SUM-shot+Vicuna	Accuracy	56.8	# 10	Compare
video narration captioning	Shot2Story20K	Ours	METEOR	24.8	# 1	Compare
			ROUGE	39	# 1	Compare
			BLEU-4	18.8	# 1	Compare
			CIDEr	168.7	# 1	Compare
Video Captioning	Shot2Story20K	Ours	CIDEr	37.4	# 1	Compare
			METEOR	16.2	# 1	Compare
			ROUGE	29.6	# 1	Compare
			BLEU-4	10.7	# 1	Compare
Video Summarization	Shot2Story20K	SUM-shot	CIDEr	8.6	# 1	Compare
			BLEU-4	11.7	# 1	Compare
			METEOR	19.7	# 1	Compare
			ROUGE	26.8	# 1	Compare

Methods

Add Remove

CLIP

Edit Social Preview

Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove