Music structure analysis (MSA) and segmentation are fundamental tasks in music information retrieval (MIR), aiming to decompose music into semantically coherent segments (e.g., verse, chorus, bridge) and reveal hierarchical structural relationships. Traditional methods rely on handcrafted audio features (e.g., MFCC, chroma) and shallow models, which struggle to capture high-level semantic and temporal dependencies in complex music. This paper proposes a novel framework for intelligent music segmentation and structure analysis leveraging self-supervised audio representation learning. First, we pre-train a Transformer-based audio encoder on a large unlabeled music corpus via masked audio modeling (MAM) to learn general-purpose, semantically rich audio representations without labeled segmentation data. Then, we design a dual-branch structure analysis network: a segment boundary detection branch using a dilated convolutional neural network (DCNN) to locate segment boundaries, and a structural similarity clustering branch using contrastive learning to group segments with consistent semantic content. We further introduce a structural entropy-based optimization module to refine hierarchical structure trees, with the objective function formulated to balance boundary precision and structural consistency. Extensive experiments on three standard MSA datasets (RWC Pop, SALAMI, Beatles) demonstrate that our method outperforms state-of-the-art baselines by 6.2% − 9.5% in F1-score for boundary detection and 5.8%-8.3% in normalized mutual information (NMI) for structural clustering. Visualization results via t -SNE confirm that self-supervised representations capture meaningful musical structure, enabling robust cross-genre music analysis.
Juan Du (Mon,) studied this question.