Vision-Language Navigation
31 papers with code • 1 benchmarks • 7 datasets
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments.
( Image credit: Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout )
Latest papers with no code
On the Importance of Karaka Framework in Multi-modal Grounding
Computational Paninian Grammar model helps in decoding a natural language expression as a series of modifier-modified relations and therefore facilitates in identifying dependency relations closer to language (context) semantics compared to the usual Stanford dependency relations.
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration
To improve the ability of fast cross-domain adaptation, we propose Prompt-based Environmental Self-exploration (ProbES), which can self-explore the environments by sampling trajectories and automatically generates structured instructions via a large-scale cross-modal pretrained model (CLIP).
Vision-Language Navigation: A Survey and Taxonomy
This paper provides a comprehensive survey and an insightful taxonomy of these tasks based on the different characteristics of language instructions in these tasks.
Modular Graph Attention Network for Complex Visual Relational Reasoning
Moreover, to capture the complex logic in a query, we construct a relational graph to represent the visual objects and their relationships, and propose a multi-step reasoning method to progressively understand the complex logic.
Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation
Vision-and-Language Navigation (VLN) is a natural language grounding task where an agent learns to follow language instructions and navigate to specified destinations in real-world environments.
Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks
In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to take advantage of the additional training signals derived from the semantic information.
Generalized Natural Language Grounded Navigation via Environment-agnostic Multitask Learning
Recent research efforts enable study for natural language grounded navigation in photo-realistic environments, e. g., following natural language instructions or dialog.
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments.