What question did this study set out to answer?

The aim is to improve text-to-music generation by enhancing style diversity, rhythmic consistency, and long-term structural modeling.

February 22, 2026Open Access

MusicDiffusionNet: Enhancing Text-to-Music Generation with Adaptive Style and Multi-Scale Temporal Mixup Strategies

Key Points

The aim is to improve text-to-music generation by enhancing style diversity, rhythmic consistency, and long-term structural modeling.
Developed MusicDiffusionNet (MDN) combining diffusion models with WaveNet architecture.
Implemented Adaptive Style Mixing (ASM) for style consistency and Multi-scale Temporal Mixing (MTM) for rhythmic coherence.
Used the Audiostock dataset for validation and assessment of the model.
MDN significantly improved generation quality and style diversity.
Achieved better rhythmic coherence across generated music.
Demonstrated enhanced performance using adaptive mixing strategies under limited data conditions.

Abstract

Text-to-music generation aims to automatically produce audio content with semantic consistency and coherent musical structure based on natural language descriptions. However, existing methods still face challenges in terms of style diversity, rhythmic consistency, and long-term structural modeling. To address these issues, we propose a novel text-to-music generation model, termed MusicDiffusionNet (MDN), which integrates diffusion models with the WaveNet architecture to jointly model musical semantics and temporal structure in a continuous latent space. By decoupling high-level semantic conditioning from low-level audio generation, MDN enhances its ability to model long-range musical structure while improving semantic alignment between text and generated music with stable generation behavior. Building upon this framework, we further design two complementary mixing strategies to improve generation quality and structural coherence. Adaptive Style Mixing (ASM) performs weighted interpolation among stylistically similar music samples in the style embedding space, incorporating key and harmonic compatibility constraints to expand the style distribution while avoiding dissonance. Multi-scale Temporal Mixing (MTM) adopts beat-aware temporal decomposition, mixing, and reorganization across multiple time scales, thereby enhancing the modeling of both local and global temporal variations while preserving rhythmic periodicity and musical groove. Both strategies are integrated into the diffusion process as conditional augmentation mechanisms, contributing to improved learning stability and representational capacity under limited data conditions. Experimental results on the Audiostock dataset demonstrate that MDN and its mixing strategies achieve consistent improvements across multiple objective metrics, including generation quality, style diversity, and rhythmic coherence, validating the effectiveness of the proposed approach for text-to-music generation.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper