Vision-Language Navigation
31 papers with code • 1 benchmarks • 7 datasets
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments.
( Image credit: Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout )
Latest papers
Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments.
Volumetric Environment Representation for Vision-Language Navigation
To achieve a comprehensive 3D representation with fine-grained details, we introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
Planning as In-Painting: A Diffusion-Based Embodied Task Planning Framework for Environments under Uncertainty
In this paper, we aim to tackle this problem with a unified framework consisting of an end-to-end trainable method and a planning algorithm.
An Embodied Generalist Agent in 3D World
Leveraging massive knowledge and learning schemes from large language models (LLMs), recent machine learning models show notable successes in building generalist agents that exhibit the capability of general-purpose task solving in diverse domains, including natural language processing, computer vision, and robotics.
Bird's-Eye-View Scene Graph for Vision-Language Navigation
Vision-language navigation (VLN), which entails an agent to navigate 3D environments following human instructions, has shown great advances.
ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments
To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation
In this paper, we propose an Adaptive Zone-aware Hierarchical Planner (AZHP) to explicitly divides the navigation process into two heterogeneous phases, i. e., sub-goal setting via zone partition/selection (high-level action) and sub-goal executing (low-level action), for hierarchical planning.
Towards Versatile Embodied Navigation
With the emergence of varied visual navigation tasks (e. g, image-/object-/audio-goal and vision-language navigation) that specify the target in different ways, the community has made appealing advances in training specialized agents capable of handling individual navigation tasks well.
DANLI: Deliberative Agent for Following Natural Language Instructions
These reactive agents are insufficient for long-horizon complex tasks.
Target-Driven Structured Transformer Planner for Vision-Language Navigation
Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions.