Vision-Language Navigation
31 papers with code • 1 benchmarks • 7 datasets
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments.
( Image credit: Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout )
Latest papers
CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations
Empirically, on the Room-Across-Room dataset, we show that our multilingual agent gets large improvements in all metrics over the strong baseline model when generalizing to unseen environments with the cross-lingual language representation and the environment-agnostic visual representation.
Reinforced Structured State-Evolution for Vision-Language Navigation
However, the crucial navigation clues (i. e., object-level environment layout) for embodied navigation task is discarded since the maintained vector is essentially unstructured.
Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation
Since the rise of vision-language navigation (VLN), great progress has been made in instruction following -- building a follower to navigate environments under the guidance of instructions.
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration
To improve the ability of fast cross-domain adaptation, we propose Prompt-based Environmental Self-exploration (ProbES), which can self-explore the environments by sampling trajectories and automatically generates structured instructions via a large-scale cross-modal pretrained model (CLIP).
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility
To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Iterative Feedback (MoTIF), where the goal is to complete a natural language command in a mobile app.
Contrastive Instruction-Trajectory Learning for Vision-Language Navigation
The vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation
Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target by destroying the most instructive information in instructions at different timesteps.
Vision-Language Navigation with Random Environmental Mixup
Then, we cross-connect the key views of different scenes to construct augmented scenes.
Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information
One key challenge in this task is to ground instructions with the current visual information that the agent perceives.
The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation
Vision-and-Language Navigation (VLN) requires an agent to find a path to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.