Vision-and-language navigation (VLN) traditionally relies on explicit reasoning chains, which, despite being interpretable, impose severe constraints on inference efficiency and scalability in long-range environments. Existing multimodal large language models (MLLMs) frequently encounter latency bottlenecks due to the generation of verbose textual narratives during decision-making. To address these limitations, we propose spatial reasoning vision-and-language navigation (SR-VLN), a novel framework that shifts the paradigm from explicit chain-of-thought (CoT) to an implicit spatial representation space. SR-VLN introduces a pyramidal hierarchical history framework integrated with perceptual compression to condense historical trajectories into multi-scale representations, effectively minimizing token overhead while preserving critical spatial semantics. Rather than generating verbose textual reasoning steps, SR-VLN employs compact, learnable spatial tokens (S-Tokens) to perform agile inference directly within the latent feature space. To establish robust causal mappings between these implicit states and navigational actions, we employ a hybrid training strategy that combines sparse reward supervision with reinforcement learning via GRPO. Extensive evaluations on the R2R, REVERIE, and SOON datasets demonstrate that SR-VLN achieves state-of-the-art overall navigation performance, while maintaining a comparable balance between accuracy and efficiency. Compared to explicit reasoning baselines, our method reduces token consumption by 68% and achieves a 4.1× speedup in inference while reaching a 76.02% success rate and a 73.80% SPL on the R2R unseen split, thereby facilitating near-real-time action prediction in long-range navigation environments.
Zhu et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: