What type of study is this?

September 5, 2025Open Access

A Unified Framework for Human Motion Generation with Multimodal Inputs

Key Points

UniMotion increases semantic generalization by 7.3% and improves prompt consistency tests by 8.9%.
In tests of multimodal prompt switching, UniMotion maintains a high motion stability of 92.4%.
The method employs a unified prompt encoder to create a shared semantic space for various inputs.
The introduction of a multimodal alignment loss function allows for better modeling across different inputs.

Abstract

Abstract To enable generalized human motion generation, this paper proposes a unified generation framework, UniMotion, which supports multimodal inputs including text, image and audio. The method uses a unified prompt encoder to map different inputs into a shared cross-modal semantic space. It adopts a two-stage motion decoder to gradually generate fine-grained skeleton sequences. A multimodal alignment loss function is introduced to strengthen consistency modeling across different prompts. In semantic generalization evaluation and prompt consistency tests, UniMotion outperforms baseline methods by 7.3% and 8.9%, respectively. In random multimodal prompt switching tests, it maintains 92.4% motion stability and logical consistency, demonstrating good practicality and scalability. This study expands the application scope of multimodal generative models in human motion modeling.

A Unified Framework for Human Motion Generation with Multimodal Inputs

Key Points

Abstract

Cite This Study

Also Consider

Also Consider