TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
visual instruction following	LLaVA-Bench	ShareGPT4V-13B	avg score	79.9	# 1
visual instruction following	LLaVA-Bench	ShareGPT4V-7B	avg score	72.6	# 2
Visual Question Answering	MM-Vet	ShareGPT4V-13B	GPT-4 score	43.1	# 28
Visual Question Answering	MM-Vet	ShareGPT4V-13B	Params	13B	# 1
Visual Question Answering	MM-Vet	ShareGPT4V-7B	GPT-4 score	37.6	# 44
Visual Question Answering	MM-Vet	ShareGPT4V-7B	Params	7B	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sharegpt4v-improving-large-multi-modal-models/visual-instruction-following-on-llava-bench)](https://paperswithcode.com/sota/visual-instruction-following-on-llava-bench?p=sharegpt4v-improving-large-multi-modal-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sharegpt4v-improving-large-multi-modal-models/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=sharegpt4v-improving-large-multi-modal-models)`

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

21 Nov 2023 · Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin ·

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.

PDF Abstract

Code

Add Remove Mark official

InternLM/InternLM-XComposer official

1,654

Tasks

Add Remove

Descriptive

visual instruction following

Visual Question Answering

World Knowledge

Datasets

Introduced in the Paper:

ShareGPT4V

Used in the Paper:

MS COCO

MM-Vet LLaVA-Bench

Results from the Paper

Edit

Ranked #1 on visual instruction following on LLaVA-Bench

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
visual instruction following	LLaVA-Bench	ShareGPT4V-13B	avg score	79.9	# 1	Compare
visual instruction following	LLaVA-Bench	ShareGPT4V-7B	avg score	72.6	# 2	Compare
Visual Question Answering	MM-Vet	ShareGPT4V-13B	GPT-4 score	43.1	# 28	Compare
Visual Question Answering	MM-Vet	ShareGPT4V-13B	Params	13B	# 1	Compare
Visual Question Answering	MM-Vet	ShareGPT4V-7B	GPT-4 score	37.6	# 44	Compare
Visual Question Answering	MM-Vet	ShareGPT4V-7B	Params	7B	# 1	Compare

Methods

Add Remove

SFT

Edit Social Preview

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove