What question did this study set out to answer?

The aim is to enhance music structure analysis and segmentation by utilizing self-supervised learning for better audio representation.

June 17, 2026Open Access

Intelligent Music Segmentation and Structure Analysis Using Self-Supervised Audio Representation Learning

Puntos clave

The aim is to enhance music structure analysis and segmentation by utilizing self-supervised learning for better audio representation.
Developed a Transformer-based audio encoder trained via masked audio modeling on unlabeled music data.
Implemented a dual-branch network: one for segment boundary detection using dilated convolutional networks, the other for structural clustering with contrastive learning.
Introduced a structural entropy-based optimization module for refining hierarchical structure trees.
Achieved a 6.2% - 9.5% improvement in F1-score for boundary detection compared to state-of-the-art methods.
Improved normalized mutual information (NMI) for structural clustering by 5.8% - 8.3%.
Visualization via t-SNE showed that learned representations effectively capture meaningful music structures.

Resumen

Music structure analysis (MSA) and segmentation are fundamental tasks in music information retrieval (MIR), aiming to decompose music into semantically coherent segments (e.g., verse, chorus, bridge) and reveal hierarchical structural relationships. Traditional methods rely on handcrafted audio features (e.g., MFCC, chroma) and shallow models, which struggle to capture high-level semantic and temporal dependencies in complex music. This paper proposes a novel framework for intelligent music segmentation and structure analysis leveraging self-supervised audio representation learning. First, we pre-train a Transformer-based audio encoder on a large unlabeled music corpus via masked audio modeling (MAM) to learn general-purpose, semantically rich audio representations without labeled segmentation data. Then, we design a dual-branch structure analysis network: a segment boundary detection branch using a dilated convolutional neural network (DCNN) to locate segment boundaries, and a structural similarity clustering branch using contrastive learning to group segments with consistent semantic content. We further introduce a structural entropy-based optimization module to refine hierarchical structure trees, with the objective function formulated to balance boundary precision and structural consistency. Extensive experiments on three standard MSA datasets (RWC Pop, SALAMI, Beatles) demonstrate that our method outperforms state-of-the-art baselines by 6.2% − 9.5% in F1-score for boundary detection and 5.8%-8.3% in normalized mutual information (NMI) for structural clustering. Visualization results via t -SNE confirm that self-supervised representations capture meaningful musical structure, enabling robust cross-genre music analysis.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Juan Du (Mon,) studied this question.

synapsesocial.com/papers/6a323dd7d50b63ecad207425 https://doi.org/https://doi.org/10.6180/jase.202609_32.068

Me gusta

Guardar

Ver artículo completo