Vision-Language Navigation
28 papers with code • 1 benchmarks • 7 datasets
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments.
( Image credit: Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout )
Latest papers with no code
Volumetric Environment Representation for Vision-Language Navigation
To achieve a comprehensive 3D representation with fine-grained details, we introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation
We evaluate the performance of our method on the Room-to-Room dataset.
Vision-Language Navigation with Embodied Intelligence: A Survey
As a long-term vision in the field of artificial intelligence, the core goal of embodied intelligence is to improve the perception, understanding, and interaction capabilities of agents and the environment.
What Is Near?: Room Locality Learning for Enhanced Robot Vision-Language-Navigation in Indoor Living Environments
We show that local-global planning based on locality knowledge and predicting the indoor layout allows the agent to efficiently select the appropriate action.
DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation
VLN-CE is a recently released embodied task, where AI agents need to navigate a freely traversable environment to reach a distant target location, given language instructions.
Bird's-Eye-View Scene Graph for Vision-Language Navigation
Vision-language navigation (VLN), which entails an agent to navigate 3D environments following human instructions, has shown great advances.
Active Semantic Localization with Graph Neural Embedding
Semantic localization, i. e., robot self-localization with semantic image modality, is critical in recently emerging embodied AI applications (e. g., point-goal navigation, object-goal navigation, vision language navigation) and topological mapping applications (e. g., graph neural SLAM, ego-centric topological map).
Accessible Instruction-Following Agent
To improve the intractability, we connect our agent with the large language model that informs the situation and current state to the user and also explains the action decisions.
Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation
Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position.
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.