VideoChat: Chat-Centric Video Understanding

10 May 2023  ·  Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao ·

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Question Answering ActivityNet-QA Video Chat Accuracy 26.5 # 31
Confidence score 2.2 # 10
Zero-Shot Video Question Answer ActivityNet-QA Video Chat Confidence Score 2.2 # 16
Accuracy 26.5 # 16
Zero-Shot Video Question Answer MSRVTT-QA Video Chat-7B Accuracy 45.0 # 18
Confidence Score 2.5 # 18
Zero-Shot Video Question Answer MSVD-QA Video Chat-7B Accuracy 56.3 # 14
Confidence Score 2.8 # 15
Video Question Answering MVBench VideoChat Avg. 35.5 # 7
Zero-Shot Video Question Answer TGIF-QA Video Chat-7B Accuracy 34.4 # 8
Confidence Score 2.3 # 7
Video-based Generative Performance Benchmarking (Correctness of Information) VideoInstruct Video Chat gpt-score 2.32 # 9
Video-based Generative Performance Benchmarking VideoInstruct Video Chat Correctness of Information 2.23 # 13
Detail Orientation 2.50 # 13
Contextual Understanding 2.53 # 14
Temporal Understanding 1.94 # 15
Consistency 2.24 # 13
mean 2.29 # 14
Video-based Generative Performance Benchmarking (Detail Orientation)) VideoInstruct Video Chat gpt-score 2.50 # 9
Video-based Generative Performance Benchmarking (Temporal Understanding) VideoInstruct Video Chat gpt-score 1.94 # 11
Video-based Generative Performance Benchmarking (Contextual Understanding) VideoInstruct Video Chat gpt-score 2.53 # 10
Video-based Generative Performance Benchmarking (Consistency) VideoInstruct Video Chat gpt-score 2.24 # 9

Methods


No methods listed for this paper. Add relevant methods here