TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Reasoning	NLVR2 Dev	SOHO	Accuracy	76.37	# 12
Visual Reasoning	NLVR2 Test	SOHO	Accuracy	77.32	# 11
Visual Entailment	SNLI-VE test	SOHO	Accuracy	84.95	# 5
Visual Entailment	SNLI-VE val	SOHO	Accuracy	85.00	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/seeing-out-of-the-box-end-to-end-pre-training/visual-entailment-on-snli-ve-test)](https://paperswithcode.com/sota/visual-entailment-on-snli-ve-test?p=seeing-out-of-the-box-end-to-end-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/seeing-out-of-the-box-end-to-end-pre-training/visual-entailment-on-snli-ve-val)](https://paperswithcode.com/sota/visual-entailment-on-snli-ve-val?p=seeing-out-of-the-box-end-to-end-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/seeing-out-of-the-box-end-to-end-pre-training/visual-reasoning-on-nlvr2-test)](https://paperswithcode.com/sota/visual-reasoning-on-nlvr2-test?p=seeing-out-of-the-box-end-to-end-pre-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/seeing-out-of-the-box-end-to-end-pre-training/visual-reasoning-on-nlvr2-dev)](https://paperswithcode.com/sota/visual-reasoning-on-nlvr2-dev?p=seeing-out-of-the-box-end-to-end-pre-training)`

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

CVPR 2021 · Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu ·

We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR$^2$ test-P split, 6.7% accuracy on SNLI-VE test split, respectively.

PDF Abstract CVPR 2021 PDF CVPR 2021 Abstract

Code

Add Remove Mark official

researchmm/soho official

206

PasserBy4/mypretraining

PasserBy4/pretraining

Tasks

Add Remove

Representation Learning

Retrieval

Text Retrieval

Visual Entailment

Visual Reasoning

Datasets

ImageNet

MS COCO

Visual Question Answering

Visual Genome

Visual Question Answering v2.0 SNLI-VE

NLVR

Results from the Paper

Edit

Ranked #5 on Visual Entailment on SNLI-VE val

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Reasoning	NLVR2 Dev	SOHO	Accuracy	76.37	# 12	Compare
Visual Reasoning	NLVR2 Test	SOHO	Accuracy	77.32	# 11	Compare
Visual Entailment	SNLI-VE test	SOHO	Accuracy	84.95	# 5	Compare
Visual Entailment	SNLI-VE val	SOHO	Accuracy	85.00	# 5	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • SOHO • Transformer

Edit Social Preview

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove