Abstract The study addresses the problem of human motion synthesis in the absence of motion capture data. A new paradigm is introduced for motion generation based on cross-modal nested alignment. The method includes a multi-scale semantic alignment module, which models natural language prompts and skeletal motion sequences in a nested manner at both local and global levels. In addition, temporal-spatial structural priors are incorporated to improve motion continuity and semantic accuracy. On the HumanML3D and T2M-Gen datasets, the proposed method improves the motion coverage metric by 12.1%, reduces motion smoothness error by 17.3%, and decreases the average inter-frame drift error by 13.5%. Compared with current mainstream models, it shows higher robustness in handling complex semantic prompts and generating long motion sequences. This study offers a new approach to motion generation driven by cross-modal alignment
Carter et al. (Thu,) studied this question.