Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

CVPR 2019 Xin WangQiuyuan HuangAsli CelikyilmazJianfeng GaoDinghan ShenYuan-Fang WangWilliam Yang WangLei Zhang

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Vision-Language Navigation Room2Room RCM + SIL spl 0.59 # 2
Visual Navigation Room-to-Room RCM+SIL(no early exploration) spl 0.38 # 2

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet