This paper proposes a novel deep learning framework for music emotion recognition based on multimodal data fusion. To address limitations of existing approaches, such as weak cross genre generalization, insufficient modeling of long range temporal dependencies, and inadequate capture of hierarchical emotional structures, the study introduces two key components: the Harmonic Semantic Encoder (HSE) and the Contrastive Harmonic Alignment (CHA) strategy. The HSE adopts a dual pathway architecture that integrates convolutional neural networks for fine grained acoustic feature extraction with a Transformer based global encoder for modeling long range temporal and harmonic dependencies. In addition, a harmonic aware attention mechanism is designed to emphasize emotionally salient frequency bands, enabling the model to better capture melody lines, chord progressions, and other musically meaningful structures. To further enhance representation quality, the CHA strategy incorporates hierarchical contrastive objectives, including local invariant contrast, structural semantic contrast, and harmonic context alignment. These objectives encourage temporally consistent, semantically discriminative, and harmonically aligned embeddings. The framework also supports multimodal fusion of audio, lyrics, and metadata through a modality aware attention mechanism, with masking and placeholder embeddings to handle missing modalities robustly. Extensive experiments on the PMEmo and GlobalMood datasets demonstrate that the proposed method consistently outperforms strong baselines such as SVM, CNN, CRNN, ResNet-based models, and lightweight architectures. The framework achieves superior accuracy and Macro-F1 scores while maintaining a favorable balance between performance and computational complexity. Ablation studies further confirm the independent contributions of the global Transformer encoder, harmonic aware attention module, and CHA learning objective. The proposed framework provides a robust and scalable solution for multimodal music emotion recognition, advancing hierarchical modeling and harmonic aware representation learning in affective computing.
Runhua Li (Thu,) studied this question.