This paper proposes a unified vision–language framework that translates user instructions into navigation for the mobile base and actions for the manipulator in indoor environments. In general, occupancy grid maps constructed via SLAM capture solely the geometric layout of the environment. This renders the robot incapable of leveraging the semantic information required for object distinction. The proposed method encodes semantic information from vision–language models and the robot’s pose in a textual format, referred to as a semantic topological graph. Specifically, the models including GroundingDINO, LG EXAONE, and SAM2 extract object-level semantic information, which is subsequently used to identify room characteristics. A large language model then interprets user instructions to identify the final destination for navigation within the semantic topological graph, followed by reasoning to determine the suitable action network. Notably, the proposed text-based representation facilitates a substantial reduction in inference time, and its effectiveness is validated through real-world experiments.
Park et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: