TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 multiple choice	FDA	Percentage correct	64.2	# 8
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 open ended	FDA	Percentage correct	59.5	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-focused-dynamic-attention-model-for-visual/visual-question-answering-on-coco-visual-1)](https://paperswithcode.com/sota/visual-question-answering-on-coco-visual-1?p=a-focused-dynamic-attention-model-for-visual)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/a-focused-dynamic-attention-model-for-visual/visual-question-answering-on-coco-visual-4)](https://paperswithcode.com/sota/visual-question-answering-on-coco-visual-4?p=a-focused-dynamic-attention-model-for-visual)`

A Focused Dynamic Attention Model for Visual Question Answering

6 Apr 2016 · Ilija Ilievski, Shuicheng Yan, Jiashi Feng ·

Visual Question and Answering (VQA) problems are attracting increasing interest from multiple research disciplines. Solving VQA problems requires techniques from both computer vision for understanding the visual contents of a presented image or video, as well as the ones from natural language processing for understanding semantics of the question and generating the answers. Regarding visual content modeling, most of existing VQA methods adopt the strategy of extracting global features from the image or video, which inevitably fails in capturing fine-grained information such as spatial configuration of multiple objects. Extracting features from auto-generated regions -- as some region-based image recognition methods do -- cannot essentially address this problem and may introduce some overwhelming irrelevant features with the question. In this work, we propose a novel Focused Dynamic Attention (FDA) model to provide better aligned image content representation with proposed questions. Being aware of the key words in the question, FDA employs off-the-shelf object detector to identify important regions and fuse the information from the regions and global features via an LSTM unit. Such question-driven representations are then combined with question representation and fed into a reasoning unit for generating the answers. Extensive evaluation on a large-scale benchmark dataset, VQA, clearly demonstrate the superior performance of FDA over well-established baselines.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Question Answering

Visual Question Answering

Visual Question Answering (VQA)

Datasets

MS COCO

Visual Question Answering

Results from the Paper

Edit

Ranked #8 on Visual Question Answering (VQA) on COCO Visual Question Answering (VQA) real images 1.0 multiple choice

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Result	Benchmark
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 multiple choice	FDA	Percentage correct	64.2	# 8		Compare
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 open ended	FDA	Percentage correct	59.5	# 9		Compare

Methods

Add Remove

LSTM • Sigmoid Activation • Tanh Activation

Edit Social Preview

A Focused Dynamic Attention Model for Visual Question Answering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove