Existing dance movement generation methods still exhibit significant deficiencies in controllability, rhythmic consistency, style retention, and long-term temporal dependency modeling. These drawbacks limit their practical deployment in applications such as virtual human driving and digital content generation. To address the aforementioned research gaps, this study proposes an improved Transformer-based dance movement generation method, aiming to enhance their naturalness, fluency, and controllability. First, this study constructs a motion-aware self-attention mechanism, which strengthens the model's ability to capture local dynamic changes by introducing temporal motion weights. Second, a dual-stream structure consisting of pose and motion streams is designed to realize joint modeling of spatial and temporal features. In addition, a cross-modal music conditioning module is introduced to align generated movements with rhythm, energy, and emotional tension. In combination with inverse kinematics and energy constraints, the physical plausibility of movements is further improved. The model also enhances generation stability through hierarchical temporal modeling and semi-supervised training. Experimental results show that the proposed method consistently outperforms baseline models across indicators, including Frechet Inception Distance, Perceptual Evaluation of Motion Quality, Motion Diversity Score, and Speed and Acceleration Consistency. It also achieves higher accuracy, precision, and recall in dance style classification tasks. These results indicate that the model can effectively capture motion style features and generate continuous and diverse dance sequences. The generation framework proposed in this study achieves a favorable balance among motion naturalness, temporal consistency, and style controllability. It can be applied to scenarios such as virtual digital human movement generation, dance creation assistance, and interactive immersive systems, providing a practically valuable technical pathway for automated dance content generation.
Xie et al. (Wed,) studied this question.