ABSTRACT Dance generation is a significant research area in computer arts and artificial intelligence. This study proposes a novel framework to enhance dance controllability and personalization through multimodal and multi‐granularity control. The framework establishes global choreographic control of long sequences via music and dance style factors, while accommodating local style variations. Simultaneously, it enables fine‐grained local control using style, text, and temporal factors for motion refinement. We develop two cross‐modal Transformers: the LS‐M2D model merges music and dance style features for local style‐controllable dance generation, and the LT‐SM2D model integrates textual guidance with music and dance style features for time‐constrained local control. Experimental results demonstrate enhanced motion quality, effective multi‐granularity style control, and precise text‐guided flexibility. This provides valuable technical support for personalized intelligent dance generation systems.
Wang et al. (Thu,) studied this question.