Abstract Music emotion recognition (MER) is a critical task in the field of music information retrieval. However, most MER research relies solely on single-scale music spectrograms and fails to consider the complementary effects of spectrograms at different scales. Meanwhile, fully extracting emotion-related information from spectrograms remains a major challenge in MER. In this paper, we propose a hybrid attention model based on multi-resolution spectrograms, named MSMHA. The MSMHA model takes multi-scale Mel-spectrograms as inputs, and each input is fed into a well-designed hybrid attention network. The designed attention network successively includes a low-level feature extraction module, a local feature extraction module based on window attention, a channel attention-based long skip connection module, a high-level feature extraction module, and a branch classifier. After being processed by the hybrid attention network, each branch can fully extract emotion-related semantic features from a spectrogram of the specific resolution and output an emotion-classification probability. Finally, a decision-level weighted fusion strategy is applied to the multi-branch outputs to generate the final classification results. The experimental results on the PMEmo dataset demonstrate that our model is both promising and effective, achieving classification accuracies of 90.9%, 86.36%, and 79.87% on the binary-arousal, binary-valence, and four-quadrant dimensions, respectively. Ablation studies further confirm the effectiveness of both the multi-resolution spectrogram inputs and each module of the hybrid attention network.
Su et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: