Humanoid robots are an ideal platform for embodied physical intelligence, but mastering mobile manipulation remains a critical challenge. While Vision-Language Models (VLMs) excel at high-level reasoning, integrating them with complex robot dynamics is difficult. This paper presents an innovative hierarchical control framework to bridge this gap. Our approach uses a VLM to decompose ambiguous commands into a sequence of executable sub-tasks. Critically, these plans are executed by a whole-body controller trained with reinforcement learning in simulation. The controller learns a resilient walking policy robust to upper-body disturbances, enabling stable execution of manipulation actions. We validate our framework on challenging door opening and socket plugging tasks, providing an effective pathway to connect VLM’s digital intelligence with a robot’s physical intelligence.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ruixuan Jiao
Bo Zhou
Fang Fang
IET conference proceedings.
Southeast University
Building similarity graph...
Analyzing shared references across papers
Loading...
Jiao et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69fed19ab9154b0b82878f6c — DOI: https://doi.org/10.1049/icp.2026.1882
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: