Key points are not available for this paper at this time.
This study introduces novel text-conditioned dance motion dataset SWDance, along with a transfer-learned diffusion model, MDMSWD, for generating dance sequences conditioned on spoken word text. To address the scarcity of dance datasets, particularly text-to-dance datasets, we propose a YouTube-sourced pipeline to collect text-to-motion data quickly and easily. Furthermore, this study is the first to generate dance motions based on non-descriptive text. Despite a neutral user preference, MDMSWD exhibited no significant disadvantage compared to ground truth. Participants expressed a strong interest in using an improved version of the model in their dance practice. The results of the study suggest exciting possibilities at the intersection of AI, dance and spoken word.
Hertog et al. (Thu,) studied this question.