TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	VCR (Q-AR) test	GPT4RoI	Accuracy	81.6	# 1
Visual Question Answering (VQA)	VCR (QA-R) test	GPT4RoI	Accuracy	91.0	# 1
Visual Question Answering (VQA)	VCR (Q-A) test	GPT4RoI	Accuracy	89.4	# 1
Visual Question Answering	ViP-Bench	GPT4ROI 7B (ROI)	GPT-4 score (bbox)	35.1	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/gpt4roi-instruction-tuning-large-language/visual-question-answering-on-vcr-q-ar-test)](https://paperswithcode.com/sota/visual-question-answering-on-vcr-q-ar-test?p=gpt4roi-instruction-tuning-large-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/gpt4roi-instruction-tuning-large-language/visual-question-answering-on-vcr-qa-r-test)](https://paperswithcode.com/sota/visual-question-answering-on-vcr-qa-r-test?p=gpt4roi-instruction-tuning-large-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/gpt4roi-instruction-tuning-large-language/visual-question-answering-on-vcr-q-a-test)](https://paperswithcode.com/sota/visual-question-answering-on-vcr-q-a-test?p=gpt4roi-instruction-tuning-large-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/gpt4roi-instruction-tuning-large-language/visual-question-answering-on-vip-bench)](https://paperswithcode.com/sota/visual-question-answering-on-vip-bench?p=gpt4roi-instruction-tuning-large-language)`

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

7 Jul 2023 · Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, Ping Luo ·

Visual instruction tuning large language model(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper, we propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction. Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence. Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience compared to previous image-level models. (1) Interaction beyond language: Users can interact with our model by both language and drawing bounding boxes to flexibly adjust the referring granularity. (2) Versatile multimodal abilities: A variety of attribute information within each RoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc. Furthermore, it can reason about multiple RoIs based on common sense. On the Visual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable accuracy of 81.6%, surpassing all existing models by a significant margin (the second place is 75.6%) and almost reaching human-level performance of 85.0%. The code, dataset, and demo can be found at https://github.com/jshilong/GPT4RoI.

PDF Abstract

Code

Add Remove Mark official

jshilong/gpt4roi official

453

sunsmarterjie/chatterbox

Tasks

Add Remove

Attribute

Common Sense Reasoning

Language Modelling

Large Language Model

Visual Commonsense Reasoning

Visual Question Answering

Visual Question Answering (VQA)

Datasets

MS COCO

Visual Genome

VCR

Visual7W

ViP-Bench

Results from the Paper

Add Remove

Ranked #1 on Visual Question Answering (VQA) on VCR (Q-AR) test

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	VCR (Q-AR) test	GPT4RoI	Accuracy	81.6	# 1	Compare
Visual Question Answering (VQA)	VCR (QA-R) test	GPT4RoI	Accuracy	91.0	# 1	Compare
Visual Question Answering (VQA)	VCR (Q-A) test	GPT4RoI	Accuracy	89.4	# 1	Compare
Visual Question Answering	ViP-Bench	GPT4ROI 7B (ROI)	GPT-4 score (bbox)	35.1	# 9	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove