What question did this study set out to answer?

To develop a model for automatic music emotion annotation and classification that enhances emotional discriminability.

synapse

⌘+K

synapse

⌘+K

March 7, 2026Open Access

A siamese vision transformer-based model for automatic music emotion annotation and classification

Key Points

To develop a model for automatic music emotion annotation and classification that enhances emotional discriminability.
Developed a Siamese Vision Transformer architecture for classification.
Focused on global feature extraction from spectrograms.
Utilized metric learning tailored for music emotion signals.
Evaluated the model using real-world music datasets.
Achieved an accuracy of 0.810 in emotion classification.
Reported a precision of 0.851 and F1-score of 0.793.
Outperformed existing baseline models demonstrating improved robustness across diverse music styles.

Abstract

Automatic Music Emotion Annotation and Classification plays a crucial role in applications such as music recommendation, retrieval, and emotional regulation. However, existing models often struggle to focus on emotionally salient regions in spectrograms and perform inconsistently when handling stylistically diverse music. Hence, we propose a novel classification model based on the Siamese Vision Transformer (SViT) architecture. Our method enhances global feature extraction and improves emotional discriminability by leveraging siamese structure and metric learning tailored for music emotion signals. Experimental evaluations on real-world music datasets show that our SViT model achieves strong performance, with an accuracy of 0.810, precision of 0.851, and F1-score of 0.793, outperforming existing baselines and demonstrating improved robustness across diverse music styles.

A siamese vision transformer-based model for automatic music emotion annotation and classification

Key Points

Abstract

Cite This Study