Deep learning progress has made the synthesis of different musical styles an important front in artificial intelligence (AI) research. But existing methods still pose significant challenges when it comes to creating proper multi-level structures, controlling styles precisely, and maintaining stable training. To solve such problems, we propose a novel model, a variational autoencoder and generative adversarial network (VAE-GAN), for high-quality, controllable multi-style music generation. It has a multi-scale temporal feature fusion transformer to capture local and global musical structures, a variational autoencoder-based style decoupling module to separate and handle content and style representations, and an adaptive adversarial training with multiple discriminators to improve generation quality and ease training. Experiments on the MAESTRO and Lakh Musical Instrument Digital Interface (MIDI) datasets show that our proposed model outperforms state-of-the-art baselines. On the MAESTRO dataset, it gets a 22.0% drop in mean squared error (MSE) and a 26.9% drop in FAD relative to the top existing technique. Similarly, it reduces the MSE of Lakh MIDI by 21.7% and the Fréchet audio distance (FAD) by 26.9%. The results show that the model is capable of generating expressive, structurally sound, and varied musical styles, making it a strong multi-style music generation solution.
Shen et al. (Wed,) studied this question.