What question did this study set out to answer?

This research aims to improve humanoid robot manipulation by integrating vision-language models and a control framework.

May 9, 2026

Vision-language guided planning and control for humanoid whole-body manipulation

Key Points

This research aims to improve humanoid robot manipulation by integrating vision-language models and a control framework.
Developed a hierarchical control framework combining vision-language models and whole-body control.
Utilized reinforcement learning to train a controller in simulation for robust manipulation actions.
Validated the approach on door opening and socket plugging tasks.
Achieved stable execution of manipulation tasks with a resilient walking policy.
Successfully decomposed complex commands into executable sub-tasks via a vision-language model.
Enhanced integration of digital intelligence with physical tasks in humanoid robots.

Abstract

Humanoid robots are an ideal platform for embodied physical intelligence, but mastering mobile manipulation remains a critical challenge. While Vision-Language Models (VLMs) excel at high-level reasoning, integrating them with complex robot dynamics is difficult. This paper presents an innovative hierarchical control framework to bridge this gap. Our approach uses a VLM to decompose ambiguous commands into a sequence of executable sub-tasks. Critically, these plans are executed by a whole-body controller trained with reinforcement learning in simulation. The controller learns a resilient walking policy robust to upper-body disturbances, enabling stable execution of manipulation actions. We validate our framework on challenging door opening and socket plugging tasks, providing an effective pathway to connect VLM’s digital intelligence with a robot’s physical intelligence.

Demander à l'IA

Bookmark

Demander à l'IA

Bookmark

Vision-language guided planning and control for humanoid whole-body manipulation

Key Points

Abstract

Cite This Study