This paper innovatively proposes an adaptive target-music generation model: it employs a controllable variational autoencoder (C-VAE) to construct decoupled structure/control latent variables, incorporates transformer-XL for modelling long-term dependencies, and combines a semantically guided modified variational autoencoder (S-GMVAE) to embed mode-emotion relationships into the latent space for controllable generation.On the MAESTRO and LMD datasets, the model achieves F1 = 93.76%and style matching = 91.84%.It maintains coherence = 90.16%even at 30% missing notes while exhibiting the lowest generation latency.Subjective evaluations reveal melody fluency, emotional authenticity, and semantic consistency all exceeding 4.6/5.Compared to PRNN, POP909-BART, MTR-VAE, and others, the model excels in both accuracy and real-time performance.Results from the experiment demonstrate that the proposed framework offers significant advantages in emotion-controlled style transfer and robust generation under missing information, providing effective support for intelligent composition, emotional soundtrack creation, and human-computer interaction music systems.
Zhaoqing Ning (Thu,) studied this question.