What question did this study set out to answer?

This research aims to enhance zero-shot vision-and-language navigation by addressing limitations of existing large language models in spatial reasoning and decision-making.

June 12, 2026Open Access

STAMP: Spatial-Temporal Anchored Motion Planning for Zero-Shot Continuous Vision-and-Language Navigation

Key Points

This research aims to enhance zero-shot vision-and-language navigation by addressing limitations of existing large language models in spatial reasoning and decision-making.
Proposed STAMP framework utilizes a multimodal spatial-temporal anchoring mechanism for navigation.
Framework adopts hierarchical design separating high-level semantic reasoning from low-level motion execution.
Conducted extensive experiments to validate performance against existing zero-shot methods in realistic settings.
STAMP achieved performance comparable to state-of-the-art zero-shot methods on VLN-CE benchmarks.
Demonstrated effective handling of perceptual misalignment in complex navigation scenarios.
Proven robustness in planning under uncertainty with online topological map construction.

Abstract

Vision-and-Language Navigation in continuous environments (VLN-CE) requires embodied agents to ground natural language instructions into reliable long-horizon motion decisions under partial observability. Despite their strong semantic understanding and reasoning abilities, Multimodal Large Language Model (LVLM) struggle when directly applied to VLN, as they lack explicit spatial grounding, embodied memory, and awareness of geometric and reachability constraints, leading to perceptual misalignment and cascading decision errors in complex scenes. To address these limitations, we propose STAMP, a Spatial-Temporal Anchored Motion Planning framework for zero-shot VLN-CE, which systematically bridges the gap between pretrained world knowledge and embodied navigation. STAMP adopts a hierarchical design that decouples high-level semantic reasoning from low-level motion execution, enabling a frozen LVLM to operate over a structured, navigation-oriented abstraction. Its core novelty lies in a multimodal spatial-temporal anchoring mechanism that explicitly encodes instruction-relevant landmarks, action semantics, depth-aware geometry, and historical navigation context, together with an explicit Chain-of-Navigation reasoning process that constrains decision-making to navigation-critical cues. Furthermore, STAMP incrementally constructs an online, backtracking-enabled topological map, supporting robust planning under uncertainty. Extensive experiments demonstrate the effectiveness of the proposed STAMP framework, achieving performance comparable to state-of-the-art zero-shot methods on VLN-CE benchmarks and in real-world settings.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Liu et al. (Wed,) studied this question.

synapsesocial.com/papers/6a2ba3fa8101cf8926f0289d https://doi.org/https://doi.org/10.3390/s26123698

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper