What question did this study set out to answer?

To develop a framework that effectively translates user instructions into navigation directions and manipulator actions for mobile robots in indoor settings.

March 12, 2026Open Access

EXAONE-VLA: A Unified Vision–Language Framework for Mobile Manipulation via Semantic Topology and Hierarchical LLM Reasoning

Key Points

To develop a framework that effectively translates user instructions into navigation directions and manipulator actions for mobile robots in indoor settings.
Constructed occupancy grid maps via SLAM to capture geometry and layout.
Encoded semantic information from vision-language models into a semantic topological graph.
Used models like GroundingDINO and LG EXAONE to extract object-level semantics.
Implemented a large language model to interpret user navigation instructions.
Significant reduction in inference time was achieved with the proposed method.
Real-world experiments confirmed the effectiveness and efficiency of the framework.

Abstract

This paper proposes a unified vision–language framework that translates user instructions into navigation for the mobile base and actions for the manipulator in indoor environments. In general, occupancy grid maps constructed via SLAM capture solely the geometric layout of the environment. This renders the robot incapable of leveraging the semantic information required for object distinction. The proposed method encodes semantic information from vision–language models and the robot’s pose in a textual format, referred to as a semantic topological graph. Specifically, the models including GroundingDINO, LG EXAONE, and SAM2 extract object-level semantic information, which is subsequently used to identify room characteristics. A large language model then interprets user instructions to identify the final destination for navigation within the semantic topological graph, followed by reasoning to determine the suitable action network. Notably, the proposed text-based representation facilitates a substantial reduction in inference time, and its effectiveness is validated through real-world experiments.

Bookmark

View Full Paper