TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Vision and Language Navigation	VLN Challenge	Airbert	success	0.78	# 3
Vision and Language Navigation	VLN Challenge	Airbert	length	686.54	# 9
Vision and Language Navigation	VLN Challenge	Airbert	error	2.58	# 143
Vision and Language Navigation	VLN Challenge	Airbert	oracle success	0.99	# 2
Vision and Language Navigation	VLN Challenge	Airbert	spl	0.01	# 134

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/airbert-in-domain-pretraining-for-vision-and/vision-and-language-navigation-on-vln)](https://paperswithcode.com/sota/vision-and-language-navigation-on-vln?p=airbert-in-domain-pretraining-for-vision-and)`

Airbert: In-domain Pretraining for Vision-and-Language Navigation

ICCV 2021 · Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid ·

Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs. We use BnB pretrain our Airbert model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

airbert-vln/airbert official

jeremylinky/youtube-vln

Tasks

Add Remove

Navigate

Referring Expression

Vision and Language Navigation

Datasets

Introduced in the Paper:

BnB

Used in the Paper:

R2R

Results from the Paper

Edit

Ranked #3 on Vision and Language Navigation on VLN Challenge

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Vision and Language Navigation	VLN Challenge	Airbert	success	0.78	# 3	Compare
			length	686.54	# 9	Compare
			error	2.58	# 143	Compare
			oracle success	0.99	# 2	Compare
			spl	0.01	# 134	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Airbert: In-domain Pretraining for Vision-and-Language Navigation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove