Vision-and-Language Navigation in continuous environments (VLN-CE) requires embodied agents to ground natural language instructions into reliable long-horizon motion decisions under partial observability. Despite their strong semantic understanding and reasoning abilities, Multimodal Large Language Model (LVLM) struggle when directly applied to VLN, as they lack explicit spatial grounding, embodied memory, and awareness of geometric and reachability constraints, leading to perceptual misalignment and cascading decision errors in complex scenes. To address these limitations, we propose STAMP, a Spatial-Temporal Anchored Motion Planning framework for zero-shot VLN-CE, which systematically bridges the gap between pretrained world knowledge and embodied navigation. STAMP adopts a hierarchical design that decouples high-level semantic reasoning from low-level motion execution, enabling a frozen LVLM to operate over a structured, navigation-oriented abstraction. Its core novelty lies in a multimodal spatial-temporal anchoring mechanism that explicitly encodes instruction-relevant landmarks, action semantics, depth-aware geometry, and historical navigation context, together with an explicit Chain-of-Navigation reasoning process that constrains decision-making to navigation-critical cues. Furthermore, STAMP incrementally constructs an online, backtracking-enabled topological map, supporting robust planning under uncertainty. Extensive experiments demonstrate the effectiveness of the proposed STAMP framework, achieving performance comparable to state-of-the-art zero-shot methods on VLN-CE benchmarks and in real-world settings.
Liu et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: