Visual Question Answering (VQA)

760 papers with code • 62 benchmarks • 112 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Benchmarks

Add a Result

These leaderboards are used to track progress in Visual Question Answering (VQA)

Dataset	Best Model	Compare
VQA v2 test-dev	PaLI	See all
VQA v2 test-std	BEiT-3	See all
OK-VQA	PaLI-X-VPD	See all
MSVD-QA	VLAB	See all
DocVQA test	Human	See all
MSRVTT-QA	VLAB	See all
InfographicVQA	Gemini Ultra (pixel only)	See all
COCO Visual Question Answering (VQA) real images 1.0 open ended	SAN	See all
CLEVR	NS-VQA (1K programs)	See all
GQA test-dev	CFR	See all
InfiMM-Eval	GPT-4V	See all
A-OKVQA	SMoLA-PaLI-X Specialist Model	See all
IconQA	ViLT	See all
VCR (Q-A) test	GPT4RoI	See all
VQA v2 val	BLIP-2 ViT-G FlanT5 XXL (zero-shot)	See all
COCO Visual Question Answering (VQA) real images 1.0 multiple choice	MCB 7 att.	See all
VQA-CP	CSS	See all
VQA-CE	RandImg	See all
VCR (QA-R) test	GPT4RoI	See all
VCR (Q-AR) test	GPT4RoI	See all
GQA test-std	NSM	See all
VQA v1 test-dev	SAAA (ResNet)	See all
IllusionVQA	GPT4-Vision	See all
VQA v1 test-std	RAU (ResNet)	See all
GQA Test2019	TRRNet (Ensemble)	See all
WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	BLIP2 FlanT5-XXL (Fine-tuned)	See all
InfoSeek	RA-VQAv2 w/ PreFLMR	See all
CLEVR-Humans	MDETR	See all
QLEVR	MAC	See all
COCO Visual Question Answering (VQA) abstract images 1.0 open ended	Graph VQA	See all
COCO Visual Question Answering (VQA) abstract 1.0 multiple choice	Graph VQA	See all
Visual7W	CMN	See all
COCO Visual Question Answering (VQA) real images 2.0 open ended	HDU-USYD-UNCC	See all
PMC-VQA	MedVInT	See all
AI2D	SMoLA-PaLI-X Specialist Model	See all
VCR (Q-A) dev	VL-BERTLARGE	See all
VCR (QA-R) dev	VL-BERTLARGE	See all
VCR (Q-AR) dev	VL-BERTLARGE	See all
VizWiz 2018	Colin	See all
VizWiz 2020 VQA	PaLI	See all
PlotQA-D1	MatCha	See all
FigureQA - test 1	PReFIL	See all
PlotQA-D2	MatCha	See all
F-VQA	ZS-F-VQA	See all
HallusionBench	GPT-4V	See all
TDIUC	Accuracy	See all
VizWiz 2020 Answerability	CLIP-Ensemble	See all
TextVQA test-standard	TAP	See all
GQA	RelViT	See all
GRIT	Unified-IOXL	See all
TGIF-QA	HiTeA	See all
VQA-X	OFA-X-MT	See all
Visual Genome (pairs)	CMN	See all
Visual Genome (subjects)	CMN	See all
DocVQA val	BERT LARGE Baseline	See all
ZS-F-VQA	SAN † - hard mask	See all
WebSRC	DUBLIN	See all
DeepForm	DUBLIN	See all
SciGraphQA	SciGraphQA-baseline	See all
DVQA test-familiar	PReFIL (Oracle OCR)	See all
CORE-MM	GPT-4V	See all
VizWiz 2018 Answerability	ensemble_two_best	See all

Show all 62 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Visual Question Answering (VQA) models and implementations

huggingface/transformers

13 papers

124,889

salesforce/lavis

7 papers

8,713

ZephyrZhuQi/ssbaseline

5 papers

gabegrand/adversarial-vqa

5 papers

See all 15 libraries.

Datasets

Subtasks

Embodied Question Answering

3D Question Answering (3D-QA)

Generative Visual Question Answering

Factual Visual Question Answering

Latest papers

Most implemented Social Latest No code

NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

lixinustc/kvq-challenge-cvpr-ntire2024 • • 17 Apr 2024

This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i. e., Kuaishou/Kwai Platform.

17 Apr 2024

Paper
Code

MoE-TinyMed: Mixture of Experts for Tiny Medical Large Vision-Language Models

jiangsongtao/tinymed • • 16 Apr 2024

Mixture of Expert Tuning (MoE-Tuning) has effectively enhanced the performance of general MLLMs with fewer parameters, yet its application in resource-limited medical settings has not been fully explored.

16 Apr 2024

Paper
Code

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

minhquan6203/vitextvqa-dataset • 16 Apr 2024

Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images.

16 Apr 2024

Paper
Code

TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding

bzluan/textcot • 15 Apr 2024

The image overview stage provides a comprehensive understanding of the global scene information, and the coarse localization stage approximates the image area containing the answer based on the question asked.

15 Apr 2024

Paper
Code

Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

faceonlive/ai-research • 12 Apr 2024

This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.

140

12 Apr 2024

Paper
Code

OmniFusion Technical Report

airi-institute/omnifusion • • 9 Apr 2024

We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality.

180

09 Apr 2024

Paper
Code

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

boheumd/MA-LMM • • 8 Apr 2024

However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

109

08 Apr 2024

Paper
Code

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

faceonlive/ai-research • 6 Apr 2024

In this paper, we present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA, especially for object-oriented perception.

140

06 Apr 2024

Paper
Code

Evaluating Text-to-Visual Generation with Image-to-Text Generation

linzhiqiu/t2i_metrics • • 1 Apr 2024

For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations.

01 Apr 2024

Paper
Code

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

atsumiyai/upd • • 29 Mar 2024

This paper introduces a novel and significant challenge for Vision Language Models (VLMs), termed Unsolvable Problem Detection (UPD).

29 Mar 2024

Paper
Code

Visual Question Answering (VQA)

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Latest papers

Content

Benchmarks

Add a Result