Autonomous mobile robots that coexist with humans must construct not only geometric maps but also semantic maps that can be accessed through natural language. Conventional semantic mapping has mainly focused on assigning labels from predefined closed vocabularies to metric maps, limiting its ability to handle novel objects, open-ended linguistic expressions, and flexible human-robot interaction. Recent advances in large-scale foundation models, particularly LLMs and VLMs, have accelerated research on open-vocabulary sematic mapping. In parallel, generative 3D representations such as neural radiance fields and 3D Gaussian splatting have enabled dense, continuous spatial representations associated with language-derived features. Together, these developments allow robots to acquire spatial semantic representations that connect perception, language, and action. This paper reviews this rapidly evolving field through a four-part taxonomy: (i) fusion of semantic features into 3D metric maps; (ii) object-centric open-vocabulary representations; (iii) hierarchical scene graph representations; and (iv) continuous generative 3D language fields. We also revisit the history of open-vocabulary semantic mapping and provide an overview of foundation model-based navigation using language-accessible maps, ranging from object-goal navigation to LLM-based hierarchical task planning. Finally, we introduce evaluation datasets, simulators, robot platforms, and evaluation metrics, and summarize seven open challenges.
Hagiwara et al. (Tue,) studied this question.