TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Dialog	Visual Dialog v1.0 test-std	NMN	NDCG (x 100)	58.1	# 52
Visual Dialog	Visual Dialog v1.0 test-std	NMN	MRR (x 100)	58.8	# 46
Visual Dialog	Visual Dialog v1.0 test-std	NMN	R@1	44.15	# 57
Visual Dialog	Visual Dialog v1.0 test-std	NMN	R@5	76.88	# 44
Visual Dialog	Visual Dialog v1.0 test-std	NMN	R@10	86.88	# 43
Visual Dialog	Visual Dialog v1.0 test-std	NMN	Mean	4.4	# 42
Visual Question Answering (VQA)	VQA v2 test-dev	N2NMN (ResNet-152, policy search)	Accuracy	64.9	# 43

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-to-reason-end-to-end-module-networks/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=learning-to-reason-end-to-end-module-networks)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/learning-to-reason-end-to-end-module-networks/visual-dialog-on-visual-dialog-v1-0-test-std)](https://paperswithcode.com/sota/visual-dialog-on-visual-dialog-v1-0-test-std?p=learning-to-reason-end-to-end-module-networks)`

Learning to Reason: End-to-End Module Networks for Visual Question Answering

ICCV 2017 · Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko ·

Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture implements this approach to question answering by parsing questions into linguistic substructures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures (by imitating expert demonstrations) while simultaneously learning network parameters (using the downstream task loss). Experimental results on the new CLEVR dataset targeted at compositional question answering show that N2NMNs achieve an error reduction of nearly 50% relative to state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question.

PDF Abstract ICCV 2017 PDF ICCV 2017 Abstract

Code

Add Remove Mark official

ronghanghu/n2nmn

270

Tasks

Add Remove

Visual Dialog

Visual Question Answering

Visual Question Answering (VQA)

Datasets

Visual Question Answering

CLEVR

Visual Question Answering v2.0

VisDial

SHAPES

Results from the Paper

Edit

Ranked #43 on Visual Question Answering (VQA) on VQA v2 test-dev

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Dialog	Visual Dialog v1.0 test-std	NMN	NDCG (x 100)	58.1	# 52	Compare
			MRR (x 100)	58.8	# 46	Compare
			R@1	44.15	# 57	Compare
			R@5	76.88	# 44	Compare
			R@10	86.88	# 43	Compare
			Mean	4.4	# 42	Compare
Visual Question Answering (VQA)	VQA v2 test-dev	N2NMN (ResNet-152, policy search)	Accuracy	64.9	# 43	Compare

Methods

Add Remove

1x1 Convolution • Average Pooling • Batch Normalization • Bottleneck Residual Block • Convolution • Global Average Pooling • Kaiming Initialization • Max Pooling • ReLU • Residual Block • Residual Connection • ResNet

Edit Social Preview

Learning to Reason: End-to-End Module Networks for Visual Question Answering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove