What question did this study set out to answer?

The aim is to develop a framework that enhances humanoid control by integrating vision and language for improved task execution.

May 22, 2026Open Access

Humanoid-WAM: World-Action Models for Vision-Language Humanoid Control

Puntos clave

The aim is to develop a framework that enhances humanoid control by integrating vision and language for improved task execution.
Developed a World-Action Model (WAM) framework for humanoid control under visual-language conditions.
Integrated safety monitoring, contact prediction, and balance-aware constraints for robust performance.
Evaluated across simulated benchmarks using MuJoCo and IsaacGym.
Improved locomotion stability with significant reduction in fall frequency.
Achieved higher task success rates compared to reactive VLA baselines.
Enhanced instruction-following performance demonstrated through quantitative metrics.

Resumen

Recent advances in Vision-Language-Action (VLA) models have demonstrated promising capabilities in instruction-conditioned robotic manipulation and embodied reasoning. However, existing approaches often struggle to generalize to humanoid whole-body control due to limited long-horizon planning, weak latent future prediction, insufficient safety awareness, and poor integration between reinforcement learning and multimodal world understanding. In particular, many current VLA systems primarily focus on reactive policy generation rather than jointly modeling future world dynamics and action evolution for humanoid embodiment. We present Humanoid-WAM, a unified World-Action Model framework for humanoid embodied intelligence that jointly learns latent world dynamics, action generation, reward prediction, contact reasoning, and safety-aware control under multimodal visual-language conditioning. Humanoid-WAM integrates RGB observations, proprioceptive sensing, depth perception, and natural language instructions into a shared latent representation, enabling scalable humanoid locomotion, navigation, and manipulation. The framework combines transformer-based VLA reasoning with latent world modeling and reinforcement learning fine-tuning, allowing the system to perform imagination-based planning and long-horizon policy optimization. To improve deployment robustness, we further introduce a runtime safety monitoring module that incorporates balance-aware constraints, contact prediction, and unsafe action filtering during control execution. We evaluate Humanoid-WAM across multiple simulated humanoid benchmarks using MuJoCo, IsaacGym, and humanoid manipulation environments. Experimental results demonstrate improved locomotion stability, higher task success rates, reduced fall frequency, and stronger instruction-following performance compared with purely reactive VLA baselines. Ablation studies further show that jointly modeling world dynamics and action generation significantly improves long-horizon control performance and policy robustness. Our results suggest that integrating latent world modeling with multimodal humanoid policy learning provides a promising direction toward scalable humanoid foundation models and general-purpose embodied intelligence systems.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo