Visual Simultaneous Localization and Mapping (vSLAM) is fundamental to enabling robotic mobile manipulation—i.e., the seamless integration of navigation, perception, and dexterous interaction with objects in unstructured environments. Yet current vSLAM research largely lacks a principled, task-oriented framework for map classification, resulting in suboptimal map representations that hinder robustness and efficiency in dynamic indoor settings. To bridge this gap, we propose a purpose-driven taxonomy of vSLAM maps specifically designed for mobile manipulation tasks. This taxonomy comprises four complementary categories: geometric 3D maps, semantic maps, object-level maps, and hybrid maps—each distinguished by its representational granularity, functional scope, and suitability for downstream manipulation primitives. We provide a systematic comparative analysis of their construction pipelines, underlying technical assumptions, and real-world deployment contexts, evaluating them rigorously across three critical dimensions: environmental adaptability, pose estimation accuracy, and real-time computational feasibility. Finally, we synthesize key limitations in existing approaches and identify concrete, high-impact directions for future work—including tight coupling between mapping semantics and manipulation affordances, and scalable learning-based map fusion.
Shen et al. (Sun,) studied this question.