Abstract To enable generalized human motion generation, this paper proposes a unified generation framework, UniMotion, which supports multimodal inputs including text, image and audio. The method uses a unified prompt encoder to map different inputs into a shared cross-modal semantic space. It adopts a two-stage motion decoder to gradually generate fine-grained skeleton sequences. A multimodal alignment loss function is introduced to strengthen consistency modeling across different prompts. In semantic generalization evaluation and prompt consistency tests, UniMotion outperforms baseline methods by 7.3% and 8.9%, respectively. In random multimodal prompt switching tests, it maintains 92.4% motion stability and logical consistency, demonstrating good practicality and scalability. This study expands the application scope of multimodal generative models in human motion modeling.
Blake et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: