TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 multiple choice	MCB 7 att.	Percentage correct	70.1	# 1
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 open ended	MCB 7 att.	Percentage correct	66.5	# 1
Phrase Grounding	Flickr30k Entities Test	MCB	Accuracy	48.69	# 1
Phrase Grounding	ReferIt	MCB	Accuracy	28.91	# 1
Visual Question Answering (VQA)	Visual7W	MCB+Att.	Percentage correct	62.2	# 4
Visual Question Answering (VQA)	VQA v1 test-dev	MCB (ResNet)	Accuracy	64.2	# 3
Visual Question Answering (VQA)	VQA v2 test-dev	MCB	Accuracy	64.7	# 45

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-compact-bilinear-pooling-for/visual-question-answering-on-coco-visual-1)](https://paperswithcode.com/sota/visual-question-answering-on-coco-visual-1?p=multimodal-compact-bilinear-pooling-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-compact-bilinear-pooling-for/visual-question-answering-on-coco-visual-4)](https://paperswithcode.com/sota/visual-question-answering-on-coco-visual-4?p=multimodal-compact-bilinear-pooling-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-compact-bilinear-pooling-for/phrase-grounding-on-flickr30k-entities-test)](https://paperswithcode.com/sota/phrase-grounding-on-flickr30k-entities-test?p=multimodal-compact-bilinear-pooling-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-compact-bilinear-pooling-for/phrase-grounding-on-referit)](https://paperswithcode.com/sota/phrase-grounding-on-referit?p=multimodal-compact-bilinear-pooling-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-compact-bilinear-pooling-for/visual-question-answering-on-vqa-v1-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v1-test-dev?p=multimodal-compact-bilinear-pooling-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-compact-bilinear-pooling-for/visual-question-answering-on-visual7w)](https://paperswithcode.com/sota/visual-question-answering-on-visual7w?p=multimodal-compact-bilinear-pooling-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multimodal-compact-bilinear-pooling-for/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=multimodal-compact-bilinear-pooling-for)`

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

EMNLP 2016 · Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach ·

Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations. We hypothesize that these methods are not as expressive as an outer product of the visual and textual vectors. As the outer product is typically infeasible due to its high dimensionality, we instead propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and expressively combine multimodal features. We extensively evaluate MCB on the visual question answering and grounding tasks. We consistently show the benefit of MCB over ablations without MCB. For visual question answering, we present an architecture which uses MCB twice, once for predicting attention over spatial features and again to combine the attended representation with the question representation. This model outperforms the state-of-the-art on the Visual7W dataset and the VQA challenge.

PDF Abstract EMNLP 2016 PDF EMNLP 2016 Abstract

Code

Add Remove Mark official

akirafukui/vqa-mcb official

218

Cadene/vqa.pytorch

699

MarcBS/keras

225

yikang-li/iqan

jnhwkim/cbp

See all 10 implementations

Tasks

Add Remove

Visual Grounding

Visual Question Answering

Visual Question Answering (VQA)

Datasets

MS COCO

Visual Question Answering

Visual Genome

Flickr30k

Visual Question Answering v2.0

Flickr30K Entities

Visual7W

Results from the Paper

Edit

Ranked #1 on Visual Question Answering (VQA) on COCO Visual Question Answering (VQA) real images 1.0 multiple choice

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 multiple choice	MCB 7 att.	Percentage correct	70.1	# 1	Compare
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 open ended	MCB 7 att.	Percentage correct	66.5	# 1	Compare
Phrase Grounding	Flickr30k Entities Test	MCB	Accuracy	48.69	# 1	Compare
Phrase Grounding	ReferIt	MCB	Accuracy	28.91	# 1	Compare
Visual Question Answering (VQA)	Visual7W	MCB+Att.	Percentage correct	62.2	# 4	Compare
Visual Question Answering (VQA)	VQA v1 test-dev	MCB (ResNet)	Accuracy	64.2	# 3	Compare
Visual Question Answering (VQA)	VQA v2 test-dev	MCB	Accuracy	64.7	# 45	Compare

Methods

Add Remove

1x1 Convolution • Average Pooling • Batch Normalization • Bottleneck Residual Block • Convolution • Global Average Pooling • Kaiming Initialization • Max Pooling • ReLU • Residual Block • Residual Connection • ResNet

Edit Social Preview

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove