Text-to-music generation aims to automatically produce audio content with semantic consistency and coherent musical structure based on natural language descriptions. However, existing methods still face challenges in terms of style diversity, rhythmic consistency, and long-term structural modeling. To address these issues, we propose a novel text-to-music generation model, termed MusicDiffusionNet (MDN), which integrates diffusion models with the WaveNet architecture to jointly model musical semantics and temporal structure in a continuous latent space. By decoupling high-level semantic conditioning from low-level audio generation, MDN enhances its ability to model long-range musical structure while improving semantic alignment between text and generated music with stable generation behavior. Building upon this framework, we further design two complementary mixing strategies to improve generation quality and structural coherence. Adaptive Style Mixing (ASM) performs weighted interpolation among stylistically similar music samples in the style embedding space, incorporating key and harmonic compatibility constraints to expand the style distribution while avoiding dissonance. Multi-scale Temporal Mixing (MTM) adopts beat-aware temporal decomposition, mixing, and reorganization across multiple time scales, thereby enhancing the modeling of both local and global temporal variations while preserving rhythmic periodicity and musical groove. Both strategies are integrated into the diffusion process as conditional augmentation mechanisms, contributing to improved learning stability and representational capacity under limited data conditions. Experimental results on the Audiostock dataset demonstrate that MDN and its mixing strategies achieve consistent improvements across multiple objective metrics, including generation quality, style diversity, and rhythmic coherence, validating the effectiveness of the proposed approach for text-to-music generation.
Xu et al. (Fri,) studied this question.