What question did this study set out to answer?

This research aims to develop a model that effectively synthesizes various musical styles using deep learning techniques.

May 29, 2026Open Access

Multi-style music generation model design based on variational autoencoders and generative adversarial networks

Key Points

This research aims to develop a model that effectively synthesizes various musical styles using deep learning techniques.
Introduced a VAE-GAN architecture incorporating a multi-scale temporal feature fusion transformer.
Developed a style decoupling module for separating content and style representations.
Implemented adaptive adversarial training with multiple discriminators for improved quality and training stability.
Achieved a 22.0% reduction in mean squared error (MSE) and a 26.9% drop in Fréchet audio distance (FAD) on the MAESTRO dataset.
Reduced the MSE of Lakh MIDI by 21.7% and the FAD by 26.9%, indicating enhanced generation quality.
Demonstrated capability for producing expressive, varied musical styles effectively.

Abstract

Deep learning progress has made the synthesis of different musical styles an important front in artificial intelligence (AI) research. But existing methods still pose significant challenges when it comes to creating proper multi-level structures, controlling styles precisely, and maintaining stable training. To solve such problems, we propose a novel model, a variational autoencoder and generative adversarial network (VAE-GAN), for high-quality, controllable multi-style music generation. It has a multi-scale temporal feature fusion transformer to capture local and global musical structures, a variational autoencoder-based style decoupling module to separate and handle content and style representations, and an adaptive adversarial training with multiple discriminators to improve generation quality and ease training. Experiments on the MAESTRO and Lakh Musical Instrument Digital Interface (MIDI) datasets show that our proposed model outperforms state-of-the-art baselines. On the MAESTRO dataset, it gets a 22.0% drop in mean squared error (MSE) and a 26.9% drop in FAD relative to the top existing technique. Similarly, it reduces the MSE of Lakh MIDI by 21.7% and the Fréchet audio distance (FAD) by 26.9%. The results show that the model is capable of generating expressive, structurally sound, and varied musical styles, making it a strong multi-style music generation solution.

Multi-style music generation model design based on variational autoencoders and generative adversarial networks

Key Points

Abstract

Cite This Study