What question did this study set out to answer?

This study aims to improve music emotion recognition by addressing the limitations of existing multimodal approaches.

June 15, 2026Open Access

Music emotion recognition using tri-modal fusion of lyrics, vocals, and accompaniment with cross-attention

Key Points

This study aims to improve music emotion recognition by addressing the limitations of existing multimodal approaches.
Developed a tri-modal framework that decouples audio into vocals and accompaniment.
Introduced a cross-attention fusion mechanism (TCAF) that establishes bidirectional attention pathways among lyrics, vocals, and accompaniment.
Evaluated the method on PMEmo dataset for regression and a new six-class classification dataset, MCEmo.
Achieved an 8.2% relative improvement in average R² on PMEmo.
Gained 3.6 percentage points in accuracy on the six-class MCEmo dataset.
Ablation studies support the necessity of decoupling vocals and accompaniment for better emotional cue capture.

Abstract

Music emotion recognition (MER) remains challenging due to the complex and heterogeneous nature of emotional expression in music. Existing multimodal approaches typically treat audio as a monolithic modality and fuse it naively with lyrics, neglecting the distinct emotional contributions of vocals and accompaniment. To address this limitation, we propose a tri-modal MER framework that explicitly decouples the audio signal into vocals and accompaniment and jointly models their interactions with lyrics through a cross-attention fusion mechanism termed Tri-Modal Cross-Attention Fusion (TCAF). Specifically, the TCAF module establishes six bidirectional attention pathways among the three modalities, enabling fine-grained cross-modal alignment and context-aware feature enhancement. We further introduce a gated fusion unit that dynamically weights each modality’s contribution on a per-sample basis. We evaluate our method on two benchmarks: the regression dataset PMEmo and our newly curated six-class classification dataset MCEmo, grounded in Ekman’s basic emotion theory. Experimental results show that our approach achieves state-of-the-art performance, significantly outperforming strong baselines—yielding an 8. 2% relative improvement in the average R² on PMEmo and a 3. 6 percentage points gain in six-class accuracy on MCEmo. Ablation studies confirm the effectiveness of audio decoupling and the proposed fusion strategy, demonstrating that modeling vocals and accompaniment as separate modalities is crucial for capturing nuanced emotional cues in music.

Bookmark

View Full Paper

Cite This Study

Li et al. (Fri,) studied this question.

synapsesocial.com/papers/6a2f96eca1cfeec490828094 https://doi.org/https://doi.org/10.1007/s11042-026-21719-3

Bookmark

View Full Paper