TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	Visual7W	CMN	Percentage correct	72.53	# 1
Visual Question Answering (VQA)	Visual Genome (pairs)	CMN	Percentage correct	28.52	# 1
Visual Question Answering (VQA)	Visual Genome (subjects)	CMN	Percentage correct	44.24	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/modeling-relationships-in-referential/visual-question-answering-on-visual7w)](https://paperswithcode.com/sota/visual-question-answering-on-visual7w?p=modeling-relationships-in-referential)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/modeling-relationships-in-referential/visual-question-answering-on-visual-genome-1)](https://paperswithcode.com/sota/visual-question-answering-on-visual-genome-1?p=modeling-relationships-in-referential)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/modeling-relationships-in-referential/visual-question-answering-on-visual-genome)](https://paperswithcode.com/sota/visual-question-answering-on-visual-genome?p=modeling-relationships-in-referential)`

Modeling Relationships in Referential Expressions with Compositional Modular Networks

CVPR 2017 · Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, Kate Saenko ·

People often refer to entities in an image in terms of their relationships with other entities. For example, "the black cat sitting under the table" refers to both a "black cat" entity and its relationship with another "table" entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of categories. In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. We call this approach Compositional Modular Networks (CMNs): a novel architecture that learns linguistic analysis and visual inference end-to-end. Our approach is built around two types of neural modules that inspect local regions and pairwise interactions between regions. We evaluate CMNs on multiple referential expression datasets, outperforming state-of-the-art approaches on all tasks.

PDF Abstract CVPR 2017 PDF CVPR 2017 Abstract

Code

Add Remove Mark official

hengyuan-hu/bottom-up-attention-vqa

745

thilinicooray/Bottom-up-vqa

Tasks

Add Remove

Visual Question Answering (VQA)

Datasets

Visual Genome

Visual7W

Results from the Paper

Edit

Ranked #1 on Visual Question Answering (VQA) on Visual7W

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	Visual7W	CMN	Percentage correct	72.53	# 1	Compare
Visual Question Answering (VQA)	Visual Genome (pairs)	CMN	Percentage correct	28.52	# 1	Compare
Visual Question Answering (VQA)	Visual Genome (subjects)	CMN	Percentage correct	44.24	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Modeling Relationships in Referential Expressions with Compositional Modular Networks

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove