With the rapid development of artificial intelligence, music generation has evolved from single-modal to cross-modal approaches and is gradually moving toward multi-modal fusion. This survey systematically reviews this developmental trajectory. The discussion begins with the representation methods for key modalities, including audio, symbolic, text, and visual data. Music generation techniques are then organized across single-modal, cross-modal, and multi-modal settings. In addition, key datasets and evaluation methodologies relevant to these tasks are compiled. Finally, the survey discusses core challenges in the field, including modal fusion, data scarcity, and evaluation frameworks, and outlines potential directions for future research.
Li et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: