Generating 3D human motions from text is a challenging yet valuable task. The key aspects of this task are ensuring text-motion consistency and achieving generation diversity. Although recent advancements have enabled the generation of precise and high-quality human motions from text, achieving diversity in the generated motions remains a significant challenge. In this paper, we aim to overcome the above challenge by designing a simple yet effective text-to-motion generation method, i. e. , Diverse-T2M. Our method introduces uncertainty into the generation process, enabling the generation of highly diverse motions while preserving the semantic consistency of the text. Specifically, we propose a novel perspective that utilizes noise signals as carriers of diversity information in transformer-based methods, facilitating a explicit modeling of uncertainty. Moreover, we construct a latent space where text is projected into a continuous representation, instead of a rigid one-to-one mapping, and integrate a latent space sampler to introduce stochastic sampling into the generation process, thereby enhancing the diversity and uncertainty of the outputs. Our results on text-to-motion generation benchmark datasets~ (HumanML3D and KIT-ML) demonstrate that our method significantly enhances diversity while maintaining state-of-the-art performance in text consistency.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zheng Qin
Shandong University
Yabing Wang
University of Science and Technology of China
Minghui Yang
China Mobile (China)
Building similarity graph...
Analyzing shared references across papers
Loading...
Qin et al. (Thu,) studied this question.
synapsesocial.com/papers/68d6e0fc8b2b6861e4c3f502 — DOI: https://doi.org/10.48550/arxiv.2508.20604