What question did this study set out to answer?

To develop a model that enhances music emotion recognition by leveraging multi-resolution spectrograms and a hybrid attention network.

March 30, 2026Open Access

Multi-resolution spectrogram based multi-branch hybrid attention network for music emotion recognition

Key Points

To develop a model that enhances music emotion recognition by leveraging multi-resolution spectrograms and a hybrid attention network.
Developed the MSMHA model that incorporates multi-scale Mel-spectrograms as inputs.
Utilized a series of modules including a low-level feature extraction module and window attention for local features.
Implemented a decision-level weighted fusion strategy to finalize classification results.
Achieved classification accuracies of 90.9% for binary-arousal, 86.36% for binary-valence, and 79.87% for four-quadrant dimensions on the PMEmo dataset.
Ablation studies corroborated the effectiveness of multi-resolution inputs and model modules.

Abstract

Abstract Music emotion recognition (MER) is a critical task in the field of music information retrieval. However, most MER research relies solely on single-scale music spectrograms and fails to consider the complementary effects of spectrograms at different scales. Meanwhile, fully extracting emotion-related information from spectrograms remains a major challenge in MER. In this paper, we propose a hybrid attention model based on multi-resolution spectrograms, named MSMHA. The MSMHA model takes multi-scale Mel-spectrograms as inputs, and each input is fed into a well-designed hybrid attention network. The designed attention network successively includes a low-level feature extraction module, a local feature extraction module based on window attention, a channel attention-based long skip connection module, a high-level feature extraction module, and a branch classifier. After being processed by the hybrid attention network, each branch can fully extract emotion-related semantic features from a spectrogram of the specific resolution and output an emotion-classification probability. Finally, a decision-level weighted fusion strategy is applied to the multi-branch outputs to generate the final classification results. The experimental results on the PMEmo dataset demonstrate that our model is both promising and effective, achieving classification accuracies of 90.9%, 86.36%, and 79.87% on the binary-arousal, binary-valence, and four-quadrant dimensions, respectively. Ablation studies further confirm the effectiveness of both the multi-resolution spectrogram inputs and each module of the hybrid attention network.

Multi-resolution spectrogram based multi-branch hybrid attention network for music emotion recognition

Key Points

Abstract

Cite This Study

Also Consider

Also Consider