Vision-Language Navigation

28 papers with code • 1 benchmarks • 7 datasets

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments.

( Image credit: Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout )

Latest papers with no code

Volumetric Environment Representation for Vision-Language Navigation

no code yet • 21 Mar 2024

To achieve a comprehensive 3D representation with fine-grained details, we introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.

TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation

no code yet • 13 Mar 2024

We evaluate the performance of our method on the Room-to-Room dataset.

Vision-Language Navigation with Embodied Intelligence: A Survey

no code yet • 22 Feb 2024

As a long-term vision in the field of artificial intelligence, the core goal of embodied intelligence is to improve the perception, understanding, and interaction capabilities of agents and the environment.

What Is Near?: Room Locality Learning for Enhanced Robot Vision-Language-Navigation in Indoor Living Environments

no code yet • 10 Sep 2023

We show that local-global planning based on locality knowledge and predicting the indoor layout allows the agent to efficiently select the appropriate action.

DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation

no code yet • ICCV 2023

VLN-CE is a recently released embodied task, where AI agents need to navigate a freely traversable environment to reach a distant target location, given language instructions.

Bird's-Eye-View Scene Graph for Vision-Language Navigation

no code yet • ICCV 2023

Vision-language navigation (VLN), which entails an agent to navigate 3D environments following human instructions, has shown great advances.

Active Semantic Localization with Graph Neural Embedding

no code yet • 10 May 2023

Semantic localization, i. e., robot self-localization with semantic image modality, is critical in recently emerging embodied AI applications (e. g., point-goal navigation, object-goal navigation, vision language navigation) and topological mapping applications (e. g., graph neural SLAM, ego-centric topological map).

Accessible Instruction-Following Agent

no code yet • 8 May 2023

To improve the intractability, we connect our agent with the large language model that informs the situation and current state to the user and also explains the action decisions.

Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation

no code yet • 13 Feb 2023

Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position.

A survey on knowledge-enhanced multimodal learning

no code yet • 19 Nov 2022

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.