What does this research mean for the field?

Applying Low-Rank Adaptation (LoRA) to the text encoder of music generation models, combined with cosine similarity-based adapter selection, significantly improves prompt-audio semantic alignment for genre-specific music generation while minimizing computational costs. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

May 16, 2026Open Access

Few-shot LoRA tuning for genre-specific music generation with semantic prompt matching

Key Points

The aim is to develop an efficient method for genre-specific music generation using minimal resources.
Applied low-rank adaptation (LoRA) to fine-tune MusicGen's text encoder.
Used cosine similarity for selecting suitable genre-specific adapters.
Conducted experiments using the FMA dataset with jazz and hip-hop genres.
Increased cosine similarity score from 0.3524 to 0.3813 for jazz prompts.
Increased cosine similarity score from 0.3154 to 0.3326 for hip-hop prompts.
Showed consistent performance gains over the baseline MusicGen-small model without LoRA.

Abstract

This paper presents an efficient method for genre-specific music generation by applying low-rank adaptation (LoRA) to the text encoder of MusicGen, a large-scale text-to-music generation model. Full fine-tuning of such models is computationally expensive and resource-intensive, making it impractical for lightweight applications or small-scale research groups. To address this, we fine-tune only a small number of parameters using LoRA, significantly reducing training cost while preserving the base model’s capabilities. Furthermore, we propose a mechanism for automatically selecting the most suitable genre-specific LoRA adapter based on cosine similarity between the user’s prompt and predefined genre labels in the text embedding space. This enables effective music generation even when the user does not explicitly mention a genre. Experiments conducted on the FMA dataset using jazz and hip-hop genres demonstrate that the proposed method improves alignment between prompts and generated audio, measured using contrastive language–audio pretraining (CLAP)-based text-audio similarity, which quantifies semantic alignment via cosine similarity in a joint text–audio embedding space. The results show consistent performance gains over the baseline MusicGen-small model without LoRA, validating the effectiveness of LoRA in genre adaptation and the proposed adapter selection strategy. On average, applying our method increased the CLAP-based text-to-audio cosine similarity score (higher indicates stronger prompt-audio semantic alignment) from 0.3524 to 0.3813 for jazz prompts and from 0.3154 to 0.3326 for hip-hop prompts. These improvements demonstrate that genre-adapted LoRA tuning yields more semantically aligned and stylistically appropriate music. Our approach enables flexible and efficient customization of music generation models with minimal resources across diverse genres and applications.

Bookmark

View Full Paper

Bookmark

View Full Paper

Few-shot LoRA tuning for genre-specific music generation with semantic prompt matching

Key Points

Abstract

Cite This Study