This paper focuses on music style transfer and audio synthesis technology, and proposes a framework based on improved Variational Autoencoder (VAE)-decoupled conditional VAE (DC-VAE). The framework realizes the decoupling modeling of music content and style through the double-branch encoder structure, and supports independent control of the core content and style characteristics of music. In terms of style transfer, DC-VAE can efficiently transfer the style of one piece of audio to another piece of audio, while keeping the core content of the original music unchanged; In the aspect of audio synthesis, by sampling from the potential space and combining with the target style conditions, we can generate brand-new audio that meets the specific style requirements and has diversity. Experiments show that DC-VAE is superior to the baseline models such as standard VAE, CVAE and CycleGAN in style accuracy, content similarity and audio quality. In addition, the naturalness and fidelity of the generated audio are further improved through the introduction of resistance loss, style consistency loss and content-style mutual information minimization loss, which verifies its effectiveness and superiority in the fields of music style transfer and audio synthesis.
Chao Guo (Sun,) studied this question.