Humanoid robots are an ideal platform for embodied physical intelligence, but mastering mobile manipulation remains a critical challenge. While Vision-Language Models (VLMs) excel at high-level reasoning, integrating them with complex robot dynamics is difficult. This paper presents an innovative hierarchical control framework to bridge this gap. Our approach uses a VLM to decompose ambiguous commands into a sequence of executable sub-tasks. Critically, these plans are executed by a whole-body controller trained with reinforcement learning in simulation. The controller learns a resilient walking policy robust to upper-body disturbances, enabling stable execution of manipulation actions. We validate our framework on challenging door opening and socket plugging tasks, providing an effective pathway to connect VLM’s digital intelligence with a robot’s physical intelligence.
Jiao et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: