TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Question Answering	ActivityNet-QA	VideoChat2	Accuracy	49.1	# 8
Video Question Answering	ActivityNet-QA	VideoChat2	Confidence score	3.3	# 2
Zero-Shot Video Question Answer	ActivityNet-QA	VideoChat2	Confidence Score	3.3	# 5
Zero-Shot Video Question Answer	ActivityNet-QA	VideoChat2	Accuracy	49.1	# 5
Zero-Shot Video Question Answer	MSRVTT-QA	VideoChat2	Accuracy	54.1	# 13
Zero-Shot Video Question Answer	MSRVTT-QA	VideoChat2	Confidence Score	3.3	# 6
Zero-Shot Video Question Answer	MSVD-QA	VideoChat2	Accuracy	70.0	# 7
Zero-Shot Video Question Answer	MSVD-QA	VideoChat2	Confidence Score	3.9	# 3
Video Question Answering	MVBench	VideoChat2	Avg.	51.9	# 3
Zero-Shot Video Question Answer	NExT-QA	VideoChat2	Accuracy	61.7	# 7
Video Question Answering	NExT-QA	VideoChat2	Accuracy	68.6	# 9
Zero-Shot Video Question Answer	STAR Benchmark	VideoChat2	Accuracy	59.0	# 1
Zero-Shot Video Question Answer	STAR Benchmark	VideoChat2	Accuracy	59.0	# 1
Zero-Shot Learning	TVQA	VideoChat2	Accuracy	40.6	# 1
Zero-Shot Video Question Answer	TVQA	VideoChat2	Accuracy	40.6	# 3
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	VideoChat2	gpt-score	2.81	# 2
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	VideoChat2	gpt-score	2.66	# 3
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	VideoChat2	gpt-score	3.51	# 3
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	VideoChat2	gpt-score	2.88	# 6
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	VideoChat2	gpt-score	3.02	# 3
Video-based Generative Performance Benchmarking	VideoInstruct	VideoChat2	Correctness of Information	3.02	# 6
Video-based Generative Performance Benchmarking	VideoInstruct	VideoChat2	Detail Orientation	2.88	# 9
Video-based Generative Performance Benchmarking	VideoInstruct	VideoChat2	Contextual Understanding	3.51	# 6
Video-based Generative Performance Benchmarking	VideoInstruct	VideoChat2	Temporal Understanding	2.66	# 6
Video-based Generative Performance Benchmarking	VideoInstruct	VideoChat2	Consistency	2.81	# 5
Video-based Generative Performance Benchmarking	VideoInstruct	VideoChat2	mean	2.98	# 8

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/zero-shot-video-question-answer-on-star)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-star?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/zero-shot-video-question-answer-on-star-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-star-1?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/zero-shot-learning-on-tvqa)](https://paperswithcode.com/sota/zero-shot-learning-on-tvqa?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-question-answering-on-mvbench)](https://paperswithcode.com/sota/video-question-answering-on-mvbench?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/zero-shot-video-question-answer-on-tvqa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-tvqa?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/zero-shot-video-question-answer-on-next-qa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-next-qa?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-question-answering-on-activitynet-qa)](https://paperswithcode.com/sota/video-question-answering-on-activitynet-qa?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/video-question-answering-on-next-qa)](https://paperswithcode.com/sota/video-question-answering-on-next-qa?p=mvbench-a-comprehensive-multi-modal-video)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mvbench-a-comprehensive-multi-modal-video/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=mvbench-a-comprehensive-multi-modal-video)`

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

28 Nov 2023 · Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao ·

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

PDF Abstract

Code

Add Remove Mark official

opengvlab/ask-anything official

↳ Quickstart in

Spaces

2,669

Tasks

Add Remove

Fairness

Multiple-choice

Video-based Generative Performance Benchmarking

Video-based Generative Performance Benchmarking (Consistency)

Video-based Generative Performance Benchmarking (Contextual Understanding)

Video-based Generative Performance Benchmarking (Correctness of Information)

Video-based Generative Performance Benchmarking (Detail Orientation))

Video-based Generative Performance Benchmarking (Temporal Understanding)

Video Question Answering

Video Understanding

Zero-Shot Learning

Zero-Shot Video Question Answer

Datasets

Introduced in the Paper:

MVBench

Used in the Paper:

ImageNet

MS COCO

Kinetics

CLEVR

GQA

OK-VQA

TextVQA

Charades-STA

WebVid

TVQA DocVQA

A-OKVQA

MMBench

MiT

ActivityNet-QA

NExT-QA

ST-VQA TextCaps MSRVTT-QA

SEED-Bench MSVD-QA

YouCook

MovieNet

VLN-CE

VisualMRC VideoInstruct

STAR Benchmark

ViQuAE

Perception Test

FunQA

Results from the Paper

Edit

Ranked #1 on Zero-Shot Video Question Answer on STAR Benchmark

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Question Answering	ActivityNet-QA	VideoChat2	Accuracy	49.1	# 8	Compare
Video Question Answering	ActivityNet-QA	VideoChat2	Confidence score	3.3	# 2	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	VideoChat2	Confidence Score	3.3	# 5	Compare
Zero-Shot Video Question Answer	ActivityNet-QA	VideoChat2	Accuracy	49.1	# 5	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	VideoChat2	Accuracy	54.1	# 13	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	VideoChat2	Confidence Score	3.3	# 6	Compare
Zero-Shot Video Question Answer	MSVD-QA	VideoChat2	Accuracy	70.0	# 7	Compare
Zero-Shot Video Question Answer	MSVD-QA	VideoChat2	Confidence Score	3.9	# 3	Compare
Video Question Answering	MVBench	VideoChat2	Avg.	51.9	# 3	Compare
Zero-Shot Video Question Answer	NExT-QA	VideoChat2	Accuracy	61.7	# 7	Compare
Video Question Answering	NExT-QA	VideoChat2	Accuracy	68.6	# 9	Compare
Zero-Shot Video Question Answer	STAR Benchmark	VideoChat2	Accuracy	59.0	# 1	Compare
Zero-Shot Video Question Answer	STAR Benchmark	VideoChat2	Accuracy	59.0	# 1	Compare
Zero-Shot Learning	TVQA	VideoChat2	Accuracy	40.6	# 1	Compare
Zero-Shot Video Question Answer	TVQA	VideoChat2	Accuracy	40.6	# 3	Compare
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	VideoChat2	gpt-score	2.81	# 2	Compare
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	VideoChat2	gpt-score	2.66	# 3	Compare
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	VideoChat2	gpt-score	3.51	# 3	Compare
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	VideoChat2	gpt-score	2.88	# 6	Compare
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	VideoChat2	gpt-score	3.02	# 3	Compare
Video-based Generative Performance Benchmarking	VideoInstruct	VideoChat2	Correctness of Information	3.02	# 6	Compare
			Detail Orientation	2.88	# 9	Compare
			Contextual Understanding	3.51	# 6	Compare
			Temporal Understanding	2.66	# 6	Compare
			Consistency	2.81	# 5	Compare
			mean	2.98	# 8	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove