Visual Question Answering (VQA)

764 papers with code • 62 benchmarks • 112 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Libraries

Use these libraries to find Visual Question Answering (VQA) models and implementations

Most implemented papers

Bilinear Attention Networks

jnhwkim/ban-vqa NeurIPS 2018

In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly.

Simple Baseline for Visual Question Answering

metalbubble/VQAbaseline 7 Dec 2015

We describe a very simple bag-of-words baseline for visual question answering.

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

necla-ml/SNLI-VE CVPR 2017

We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter!

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

sea-snell/implicit-language-q-learning ICCV 2017

Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images.

Towards VQA Models That Can Read

facebookresearch/pythia CVPR 2019

We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset.

Deep Modular Co-Attention Networks for Visual Question Answering

MILVLG/mcan-vqa CVPR 2019

In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth.

VisualBERT: A Simple and Performant Baseline for Vision and Language

uclanlp/visualbert 9 Aug 2019

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.

UNITER: UNiversal Image-TExt Representation Learning

ChenRocks/UNITER ECCV 2020

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

opengvlab/llama-adapter 28 Mar 2023

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model.