What question did this study set out to answer?

The aim is to create a robust framework for music emotion recognition by leveraging multimodal data sources.

June 13, 2026Open Access

A deep learning framework for emotion recognition in music using multimodal data fusion

Key Points

The aim is to create a robust framework for music emotion recognition by leveraging multimodal data sources.
Proposed a deep learning framework incorporating Harmonic Semantic Encoder (HSE) and Contrastive Harmonic Alignment (CHA) strategy.
Utilized convolutional neural networks paired with a Transformer-based global encoder for long range and harmonic dependency modeling.
Conducted experiments on PMEmo and GlobalMood datasets to validate performance against established baselines.
Achieved superior accuracy and Macro-F1 scores compared to SVM, CNN, CRNN, and ResNet-based models.
Demonstrated the framework's capability to handle multimodal inputs effectively, even with missing data.
Confirmed contributions of the global Transformer encoder and attention module through ablation studies.

Abstract

This paper proposes a novel deep learning framework for music emotion recognition based on multimodal data fusion. To address limitations of existing approaches, such as weak cross genre generalization, insufficient modeling of long range temporal dependencies, and inadequate capture of hierarchical emotional structures, the study introduces two key components: the Harmonic Semantic Encoder (HSE) and the Contrastive Harmonic Alignment (CHA) strategy. The HSE adopts a dual pathway architecture that integrates convolutional neural networks for fine grained acoustic feature extraction with a Transformer based global encoder for modeling long range temporal and harmonic dependencies. In addition, a harmonic aware attention mechanism is designed to emphasize emotionally salient frequency bands, enabling the model to better capture melody lines, chord progressions, and other musically meaningful structures. To further enhance representation quality, the CHA strategy incorporates hierarchical contrastive objectives, including local invariant contrast, structural semantic contrast, and harmonic context alignment. These objectives encourage temporally consistent, semantically discriminative, and harmonically aligned embeddings. The framework also supports multimodal fusion of audio, lyrics, and metadata through a modality aware attention mechanism, with masking and placeholder embeddings to handle missing modalities robustly. Extensive experiments on the PMEmo and GlobalMood datasets demonstrate that the proposed method consistently outperforms strong baselines such as SVM, CNN, CRNN, ResNet-based models, and lightweight architectures. The framework achieves superior accuracy and Macro-F1 scores while maintaining a favorable balance between performance and computational complexity. Ablation studies further confirm the independent contributions of the global Transformer encoder, harmonic aware attention module, and CHA learning objective. The proposed framework provides a robust and scalable solution for multimodal music emotion recognition, advancing hierarchical modeling and harmonic aware representation learning in affective computing.

Bookmark

View Full Paper

Bookmark

View Full Paper

A deep learning framework for emotion recognition in music using multimodal data fusion

Key Points

Abstract

Cite This Study