Automatic Music Emotion Annotation and Classification plays a crucial role in applications such as music recommendation, retrieval, and emotional regulation. However, existing models often struggle to focus on emotionally salient regions in spectrograms and perform inconsistently when handling stylistically diverse music. Hence, we propose a novel classification model based on the Siamese Vision Transformer (SViT) architecture. Our method enhances global feature extraction and improves emotional discriminability by leveraging siamese structure and metric learning tailored for music emotion signals. Experimental evaluations on real-world music datasets show that our SViT model achieves strong performance, with an accuracy of 0.810, precision of 0.851, and F1-score of 0.793, outperforming existing baselines and demonstrating improved robustness across diverse music styles.
Li et al. (Wed,) studied this question.