Vision-and-language navigation (VLN) has emerged as a pivotal domain in the advancement of embodied artificial intelligence, leveraging multimodal perception and reasoning. This survey traces the cognitive trajectory of VLN agents, utilizing Piaget’s developmental stages as a framework for analyzing their technical progression. We explore the transition of VLN agents from ‘infant’ agents, limited to reactive behaviors, to ‘adult’ agents exhibiting high-level abstract reasoning. The proposed categorization divides VLN development into four successive stages: Sensorimotor (basic perception-action coupling), Preoperational (symbolic processing and memory), Concrete Operational (structured planning), and Formal Operational (abstract reasoning). This developmental framework provides a comprehensive understanding of the transition from basic reactive behavior to complex cognitive processing within VLN agents, highlighting the influential role of emerging foundational models in accelerating this transition. We also discuss key challenges associated with grounding language models in embodied environments and offer suggestions for bridging perception and high-level cognitive functions in VLN research. This review seeks to contribute a unified cognitive perspective, offering insights for future advancements in VLN and foundational model research.
Gao et al. (Wed,) studied this question.