TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	HallusionBench	LRV-Instruct	Question Pair Acc	1.57	# 4
Visual Question Answering	MM-Vet	LRV-Instruction-7B	GPT-4 score	31.7±0.1	# 73
Visual Question Answering	MM-Vet	LRV-Instruction-7B	Params	7B	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/aligning-large-multi-modal-model-with-robust/visual-question-answering-vqa-on-3)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-3?p=aligning-large-multi-modal-model-with-robust)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/aligning-large-multi-modal-model-with-robust/visual-question-answering-on-mm-vet)](https://paperswithcode.com/sota/visual-question-answering-on-mm-vet?p=aligning-large-multi-modal-model-with-robust)`

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

26 Jun 2023 · Fuxiao Liu, Kevin Lin, Linjie Li, JianFeng Wang, Yaser Yacoob, Lijuan Wang ·

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data are available at https://github.com/FuxiaoLiu/LRV-Instruction.

PDF Abstract

Code

Add Remove Mark official

FuxiaoLiu/LRV-Instruction official

↳ Quickstart in

Spaces

213

h-zhao1997/cobra

↳ Quickstart in

Spaces

182

FuxiaoLiu/VisualNews-Repository

fuxiaoliu/mmc

Tasks

Add Remove

Hallucination

Visual Question Answering

Visual Question Answering (VQA)

Datasets

Visual Genome

GQA

MM-Vet HallusionBench

Results from the Paper

Edit

Ranked #4 on Visual Question Answering (VQA) on HallusionBench

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	HallusionBench	LRV-Instruct	Question Pair Acc	1.57	# 4	Compare
Visual Question Answering	MM-Vet	LRV-Instruction-7B	GPT-4 score	31.7±0.1	# 73	Compare
Visual Question Answering	MM-Vet	LRV-Instruction-7B	Params	7B	# 1	Compare

Methods

Add Remove

Focus

Edit Social Preview

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove