TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Zero-Shot Video Question Answer	EgoSchema (fullset)	mPLUG-Owl (7B)	Accuracy	31.1	# 8
Visual Question Answering (VQA)	HallusionBench	mPLUG-Owl	Question Pair Acc	2.36	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-owl-modularization-empowers-large/visual-question-answering-vqa-on-3)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-3?p=mplug-owl-modularization-empowers-large)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mplug-owl-modularization-empowers-large/zero-shot-video-question-answer-on-egoschema-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema-1?p=mplug-owl-modularization-empowers-large)`

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

27 Apr 2023 · Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou ·

Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.

PDF Abstract

Code

Add Remove Mark official

x-plug/mplug-owl official

1,941

Tasks

Add Remove

Visual Question Answering (VQA)

Zero-Shot Video Question Answer

Datasets

Conceptual Captions EgoSchema HallusionBench

Results from the Paper

Edit

Ranked #3 on Visual Question Answering (VQA) on HallusionBench

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Result	Benchmark
Zero-Shot Video Question Answer	EgoSchema (fullset)	mPLUG-Owl (7B)	Accuracy	31.1	# 8		Compare
Visual Question Answering (VQA)	HallusionBench	mPLUG-Owl	Question Pair Acc	2.36	# 3		Compare

Methods

Add Remove

ALIGN

Edit Social Preview

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove