PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

17 May 2023  ยท  Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya zhang, Yanfeng Wang, Weidi Xie ยท

In this paper, we focus on the problem of Medical Visual Question Answering (MedVQA), which is crucial in efficiently interpreting medical images with vital clinic-relevant information. Firstly, we reframe the problem of MedVQA as a generation task that naturally follows the human-machine interaction, we propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. Secondly, we establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases. Thirdly, we pre-train our proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD and SLAKE, outperforming existing work by a large margin. Additionally, we propose a test set that has undergone manual verification, which is significantly more challenging, even the best models struggle to solve.

PDF Abstract

Datasets


Introduced in the Paper:

PMC-VQA

Used in the Paper:

VQA-RAD PMC-OA
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Medical Visual Question Answering PMC-VQA MedVInT Accuracy 42.3 # 1
Generative Visual Question Answering PMC-VQA MedVInT BLEU-1 23.2 # 1
Visual Question Answering (VQA) PMC-VQA MedVInT Accuracy 42.3 # 1
Medical Visual Question Answering VQA-RAD PMC-VQA Close-ended Accuracy 86.8 # 2
Open-ended Accuracy 73.7 # 1
Overall Accuracy 81.6 # 2

Methods