This paper presents an efficient method for genre-specific music generation by applying low-rank adaptation (LoRA) to the text encoder of MusicGen, a large-scale text-to-music generation model. Full fine-tuning of such models is computationally expensive and resource-intensive, making it impractical for lightweight applications or small-scale research groups. To address this, we fine-tune only a small number of parameters using LoRA, significantly reducing training cost while preserving the base model’s capabilities. Furthermore, we propose a mechanism for automatically selecting the most suitable genre-specific LoRA adapter based on cosine similarity between the user’s prompt and predefined genre labels in the text embedding space. This enables effective music generation even when the user does not explicitly mention a genre. Experiments conducted on the FMA dataset using jazz and hip-hop genres demonstrate that the proposed method improves alignment between prompts and generated audio, measured using contrastive language–audio pretraining (CLAP)-based text-audio similarity, which quantifies semantic alignment via cosine similarity in a joint text–audio embedding space. The results show consistent performance gains over the baseline MusicGen-small model without LoRA, validating the effectiveness of LoRA in genre adaptation and the proposed adapter selection strategy. On average, applying our method increased the CLAP-based text-to-audio cosine similarity score (higher indicates stronger prompt-audio semantic alignment) from 0.3524 to 0.3813 for jazz prompts and from 0.3154 to 0.3326 for hip-hop prompts. These improvements demonstrate that genre-adapted LoRA tuning yields more semantically aligned and stylistically appropriate music. Our approach enables flexible and efficient customization of music generation models with minimal resources across diverse genres and applications.
Lee et al. (Thu,) studied this question.