Current Vision-Language-Action (VLA) models for embodied navigation are effectively stateless, limiting long-horizon reasoning in complex indoor environments. We present HRM-VLA, a hierarchical architecture for Vision-Language Navigation in Continuous Environments (VLN-CE) that introduces persistent, multi-scale spatial memory. The system combines a DINOv2 vision encoder and Qwen2. 5 language encoder with an Adaptive Hierarchical State-Space Memory module at three temporal scales (Object, Furniture, Room), with learned halting for adaptive computation depth. A cross-modal VLA decoder fuses spatial memory with language via bidirectional cross-attention. ALiBi and RoPE modules are architecturally present but not fully activated in the current forward path. We train with a stabilized DAgger pipeline including oracle-supervised rollouts, replay buffering, class balancing, and collapse monitoring. On R2R-VLN-CE, the full model achieves 15. 0% SR on valₛeen and 1. 5% on valᵤnseen. Ablation of spatial memory reduces SR to 1. 0%/0. 0% respectively. An unexpected modality collapse finding, in which the agent performs better without language instructions, and the substantial generalization gap define directionsfor future work.
Pranav Wagh (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: