What question did this study set out to answer?

Evaluate the effectiveness of a hierarchical architecture in enhancing vision-language navigation capabilities in complex environments.

June 12, 2026Open Access

Adaptive Hierarchical State-Space Models for Robust Vision-Language Navigation

Key Points

Evaluate the effectiveness of a hierarchical architecture in enhancing vision-language navigation capabilities in complex environments.
Developed HRM-VLA model combining vision and language encoders with a spatial memory module at multiple levels.
Trained using a specialized pipeline including oracle-supervised rollouts and replay buffering on the R2R-VLN-CE dataset.
Conducted ablation studies to assess the impact of spatial memory on navigation performance.
Full model achieved 15.0% success rate on val_seen and 1.5% on val_unseen in vision-language navigation.
Ablation of spatial memory reduced success rates to 1.0% on val_seen and 0.0% on val_unseen.
Notable finding indicates better performance without language instructions, highlighting a generalization gap.

Abstract

Current Vision-Language-Action (VLA) models for embodied navigation are effectively stateless, limiting long-horizon reasoning in complex indoor environments. We present HRM-VLA, a hierarchical architecture for Vision-Language Navigation in Continuous Environments (VLN-CE) that introduces persistent, multi-scale spatial memory. The system combines a DINOv2 vision encoder and Qwen2. 5 language encoder with an Adaptive Hierarchical State-Space Memory module at three temporal scales (Object, Furniture, Room), with learned halting for adaptive computation depth. A cross-modal VLA decoder fuses spatial memory with language via bidirectional cross-attention. ALiBi and RoPE modules are architecturally present but not fully activated in the current forward path. We train with a stabilized DAgger pipeline including oracle-supervised rollouts, replay buffering, class balancing, and collapse monitoring. On R2R-VLN-CE, the full model achieves 15. 0% SR on valₛeen and 1. 5% on valᵤnseen. Ablation of spatial memory reduces SR to 1. 0%/0. 0% respectively. An unexpected modality collapse finding, in which the agent performs better without language instructions, and the substantial generalization gap define directionsfor future work.

Bookmark

View Full Paper

Cite This Study

Pranav Wagh (Wed,) studied this question.

synapsesocial.com/papers/6a2ba34a8101cf8926f01fe6 https://doi.org/https://doi.org/10.17615/s9re-1h34

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper