MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

28 Nov 2023  ยท  Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao ยท

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Question Answering ActivityNet-QA VideoChat2 Accuracy 49.1 # 8
Confidence score 3.3 # 2
Zero-Shot Video Question Answer ActivityNet-QA VideoChat2 Confidence Score 3.3 # 5
Accuracy 49.1 # 5
Zero-Shot Video Question Answer MSRVTT-QA VideoChat2 Accuracy 54.1 # 13
Confidence Score 3.3 # 6
Zero-Shot Video Question Answer MSVD-QA VideoChat2 Accuracy 70.0 # 7
Confidence Score 3.9 # 3
Video Question Answering MVBench VideoChat2 Avg. 51.9 # 3
Zero-Shot Video Question Answer NExT-QA VideoChat2 Accuracy 61.7 # 7
Video Question Answering NExT-QA VideoChat2 Accuracy 68.6 # 9
Zero-Shot Video Question Answer STAR Benchmark VideoChat2 Accuracy 59.0 # 1
Accuracy 59.0 # 1
Zero-Shot Learning TVQA VideoChat2 Accuracy 40.6 # 1
Zero-Shot Video Question Answer TVQA VideoChat2 Accuracy 40.6 # 3
Video-based Generative Performance Benchmarking (Consistency) VideoInstruct VideoChat2 gpt-score 2.81 # 2
Video-based Generative Performance Benchmarking (Temporal Understanding) VideoInstruct VideoChat2 gpt-score 2.66 # 3
Video-based Generative Performance Benchmarking (Contextual Understanding) VideoInstruct VideoChat2 gpt-score 3.51 # 3
Video-based Generative Performance Benchmarking (Detail Orientation)) VideoInstruct VideoChat2 gpt-score 2.88 # 6
Video-based Generative Performance Benchmarking (Correctness of Information) VideoInstruct VideoChat2 gpt-score 3.02 # 3
Video-based Generative Performance Benchmarking VideoInstruct VideoChat2 Correctness of Information 3.02 # 6
Detail Orientation 2.88 # 9
Contextual Understanding 3.51 # 6
Temporal Understanding 2.66 # 6
Consistency 2.81 # 5
mean 2.98 # 8

Methods


No methods listed for this paper. Add relevant methods here