Visual Question Answering (VQA)

758 papers with code • 62 benchmarks • 112 datasets

Visual Question Answering (VQA) is a task in computer vision that involves answering questions about an image. The goal of VQA is to teach machines to understand the content of an image and answer questions about it in natural language.

Image Source: visualqa.org

Libraries

Use these libraries to find Visual Question Answering (VQA) models and implementations

Most implemented papers

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

facebookresearch/vilbert-multi-task NeurIPS 2019

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

akirafukui/vqa-mcb EMNLP 2016

Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.

Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

peteanderson80/bottom-up-attention CVPR 2018

This paper presents a state-of-the-art model for visual question answering (VQA), which won the first place in the 2017 VQA Challenge.

Compositional Attention Networks for Machine Reasoning

stanfordnlp/mac-network ICLR 2018

We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning.

Hierarchical Question-Image Co-Attention for Visual Question Answering

jiasenlu/HieCoAttenVQA NeurIPS 2016

In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

Pythia v0.1: the Winning Entry to the VQA Challenge 2018

facebookresearch/pythia 26 Jul 2018

We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

airsplay/lxmert IJCNLP 2019

In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.

GPT-4 Technical Report

openai/evals Preprint 2023

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.

Hadamard Product for Low-rank Bilinear Pooling

jnhwkim/MulLowBiVQA 14 Oct 2016

Bilinear models provide rich representations compared with linear models.

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

peteanderson80/Matterport3DSimulator CVPR 2018

This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering.