What question did this study set out to answer?

This survey aims to trace the cognitive development of vision-and-language navigation (VLN) agents through established developmental stages.

June 5, 2026Open Access

From perception to reasoning: A survey of vision-and-language navigation through cognitive development

Key Points

This survey aims to trace the cognitive development of vision-and-language navigation (VLN) agents through established developmental stages.
Utilized Piaget's developmental stages to analyze VLN agents' technical progression.
Categorized VLN development into four stages: Sensorimotor, Preoperational, Concrete Operational, and Formal Operational.
Discussed challenges of grounding language models in embodied environments.
Identified the transition of VLN agents from simple reactive behaviors to complex reasoning.
Highlighted the importance of foundational models in enhancing VLN cognitive development.
Outlined significant obstacles in integrating perception and cognitive functions within VLN.

Abstract

Vision-and-language navigation (VLN) has emerged as a pivotal domain in the advancement of embodied artificial intelligence, leveraging multimodal perception and reasoning. This survey traces the cognitive trajectory of VLN agents, utilizing Piaget’s developmental stages as a framework for analyzing their technical progression. We explore the transition of VLN agents from ‘infant’ agents, limited to reactive behaviors, to ‘adult’ agents exhibiting high-level abstract reasoning. The proposed categorization divides VLN development into four successive stages: Sensorimotor (basic perception-action coupling), Preoperational (symbolic processing and memory), Concrete Operational (structured planning), and Formal Operational (abstract reasoning). This developmental framework provides a comprehensive understanding of the transition from basic reactive behavior to complex cognitive processing within VLN agents, highlighting the influential role of emerging foundational models in accelerating this transition. We also discuss key challenges associated with grounding language models in embodied environments and offer suggestions for bridging perception and high-level cognitive functions in VLN research. This review seeks to contribute a unified cognitive perspective, offering insights for future advancements in VLN and foundational model research.

Ask AI

Helpful

Bookmark

View Full Paper