Music emotion recognition (MER) remains challenging due to the complex and heterogeneous nature of emotional expression in music. Existing multimodal approaches typically treat audio as a monolithic modality and fuse it naively with lyrics, neglecting the distinct emotional contributions of vocals and accompaniment. To address this limitation, we propose a tri-modal MER framework that explicitly decouples the audio signal into vocals and accompaniment and jointly models their interactions with lyrics through a cross-attention fusion mechanism termed Tri-Modal Cross-Attention Fusion (TCAF). Specifically, the TCAF module establishes six bidirectional attention pathways among the three modalities, enabling fine-grained cross-modal alignment and context-aware feature enhancement. We further introduce a gated fusion unit that dynamically weights each modality’s contribution on a per-sample basis. We evaluate our method on two benchmarks: the regression dataset PMEmo and our newly curated six-class classification dataset MCEmo, grounded in Ekman’s basic emotion theory. Experimental results show that our approach achieves state-of-the-art performance, significantly outperforming strong baselines—yielding an 8. 2% relative improvement in the average R² on PMEmo and a 3. 6 percentage points gain in six-class accuracy on MCEmo. Ablation studies confirm the effectiveness of audio decoupling and the proposed fusion strategy, demonstrating that modeling vocals and accompaniment as separate modalities is crucial for capturing nuanced emotional cues in music.
Li et al. (Fri,) studied this question.