TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Video Question Answer	ActivityNet-QA	MiniGPT4-video-7B	Accuracy	46.3	# 9
Zero-Shot Video Question Answer	MSRVTT-QA	MiniGPT4-video-7B	Accuracy	59.73	# 7
Zero-Shot Video Question Answer	MSVD-QA	MiniGPT4-video-7B	Accuracy	73.92	# 6
Zero-Shot Video Question Answer	TGIF-QA	MiniGPT4-video-7B	Accuracy	72.22	# 3
Zero-Shot Video Question Answer	TVQA	MiniGPT4-video-7B	Accuracy	54.21	# 3
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	MiniGPT4-video-7B	gpt-score	3.08	# 3
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	MiniGPT4-video-7B	gpt-score	3.02	# 4
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	MiniGPT4-video-7B	gpt-score	3.57	# 3
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	MiniGPT4-video-7B	gpt-score	2.65	# 4
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	MiniGPT4-video-7B	gpt-score	2.67	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/minigpt4-video-advancing-multimodal-llms-for/zeroshot-video-question-answer-on-tgif-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-tgif-qa?p=minigpt4-video-advancing-multimodal-llms-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/minigpt4-video-advancing-multimodal-llms-for/zero-shot-video-question-answer-on-tvqa)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-tvqa?p=minigpt4-video-advancing-multimodal-llms-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/minigpt4-video-advancing-multimodal-llms-for/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=minigpt4-video-advancing-multimodal-llms-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/minigpt4-video-advancing-multimodal-llms-for/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=minigpt4-video-advancing-multimodal-llms-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/minigpt4-video-advancing-multimodal-llms-for/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=minigpt4-video-advancing-multimodal-llms-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/minigpt4-video-advancing-multimodal-llms-for/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=minigpt4-video-advancing-multimodal-llms-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/minigpt4-video-advancing-multimodal-llms-for/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=minigpt4-video-advancing-multimodal-llms-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/minigpt4-video-advancing-multimodal-llms-for/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=minigpt4-video-advancing-multimodal-llms-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/minigpt4-video-advancing-multimodal-llms-for/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=minigpt4-video-advancing-multimodal-llms-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/minigpt4-video-advancing-multimodal-llms-for/zeroshot-video-question-answer-on-activitynet)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet?p=minigpt4-video-advancing-multimodal-llms-for)`

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

4 Apr 2024 · Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny ·

This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the complexities of videos. Building upon the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images and achieved impressive results on various image-text benchmarks, this paper extends the model's capabilities to process a sequence of frames, enabling it to comprehend videos. MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components. The proposed model outperforms existing state-of-the-art methods, registering gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks respectively. Our models and code have been made publicly available here https://vision-cair.github.io/MiniGPT4-video/

PDF Abstract

Code

Add Remove Mark official

Vision-CAIR/MiniGPT4-video

↳ Quickstart in

Spaces

371

Tasks

Add Remove

Language Modelling

Large Language Model

Video-based Generative Performance Benchmarking (Consistency)

Video-based Generative Performance Benchmarking (Contextual Understanding)

Video-based Generative Performance Benchmarking (Correctness of Information)

Video-based Generative Performance Benchmarking (Detail Orientation))

Video-based Generative Performance Benchmarking (Temporal Understanding)

Video Question Answering

Video Understanding

Zeroshot Video Question Answer

Zero-Shot Video Question Answer

Datasets

ActivityNet

TVQA

ActivityNet-QA

TGIF-QA MSRVTT-QA MSVD-QA VideoInstruct CMD

Results from the Paper

Edit

Ranked #3 on Zero-Shot Video Question Answer on TVQA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Video Question Answer	ActivityNet-QA	MiniGPT4-video-7B	Accuracy	46.3	# 9	Compare
Zero-Shot Video Question Answer	MSRVTT-QA	MiniGPT4-video-7B	Accuracy	59.73	# 7	Compare
Zero-Shot Video Question Answer	MSVD-QA	MiniGPT4-video-7B	Accuracy	73.92	# 6	Compare
Zero-Shot Video Question Answer	TGIF-QA	MiniGPT4-video-7B	Accuracy	72.22	# 3	Compare
Zero-Shot Video Question Answer	TVQA	MiniGPT4-video-7B	Accuracy	54.21	# 3	Compare
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	MiniGPT4-video-7B	gpt-score	3.08	# 3	Compare
Video-based Generative Performance Benchmarking (Detail Orientation))	VideoInstruct	MiniGPT4-video-7B	gpt-score	3.02	# 4	Compare
Video-based Generative Performance Benchmarking (Contextual Understanding)	VideoInstruct	MiniGPT4-video-7B	gpt-score	3.57	# 3	Compare
Video-based Generative Performance Benchmarking (Temporal Understanding)	VideoInstruct	MiniGPT4-video-7B	gpt-score	2.65	# 4	Compare
Video-based Generative Performance Benchmarking (Consistency)	VideoInstruct	MiniGPT4-video-7B	gpt-score	2.67	# 5	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove