We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference... (read more)
PDFMETHOD | TYPE | |
---|---|---|
![]() |
Working Memory Models |