Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

17 Nov 2015 Huijuan Xu Kate Saenko

We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Visual Question Answering COCO Visual Question Answering (VQA) real images 1.0 open ended SMem-VQA Percentage correct 58.2 # 11

Methods used in the Paper


METHOD TYPE
Memory Network
Working Memory Models