Vision and Language Navigation
88 papers with code • 5 benchmarks • 13 datasets
Libraries
Use these libraries to find Vision and Language Navigation models and implementationsLatest papers
VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation
The performance of the Vision-and-Language Navigation~(VLN) tasks has witnessed rapid progress recently thanks to the use of large pre-trained vision-and-language models.
AerialVLN: Vision-and-Language Navigation for UAVs
Navigating in the sky is more complicated than on the ground because agents need to consider the flying height and more complex spatial relationship reasoning.
Scaling Data Generation in Vision-and-Language Navigation
Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents.
Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation
We introduce a novel speaker model \textsc{Kefa} for navigation instruction generation.
GridMM: Grid Memory Map for Vision-and-Language Navigation
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments.
Learning Navigational Visual Representations with Semantic Map Supervision
Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot.
Learning Vision-and-Language Navigation from YouTube Videos
In this paper, we propose to learn an agent from these videos by creating a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it.
Behavioral Analysis of Vision-and-Language Navigation Agents
To be successful, Vision-and-Language Navigation (VLN) agents must be able to ground instructions to actions based on their surroundings.
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action.
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling.