In the field of symbolic music generation, maintaining macro-structural coherence and preventing logical drift within long sequences remains a critical challenge. Traditional autoregressive models primarily rely on implicit probabilistic statistics to capture contextual dependencies. Consequently, they often struggle to retain memory of initial musical motifs over hundreds of time steps. To address this limitation, this paper proposes CAST, a framework for long-range music generation based on explicit skeleton guidance. This method decouples the complex sequence generation task into two subprocesses: macro-harmonic planning and micro-texture filling. Specifically, MusicBERT is introduced to extract deep semantic skeletons. We then utilize a cross-attention mechanism to establish a dynamic mapping between these skeletons and the MuseFormer generator. This design achieves explicit modeling of long-range dependencies. For validation, we selected MuseFormer, which represents the state-of-the-art in sparse attention mechanisms, as the baseline model. Experiments were conducted to verify the superiority of explicit structural constraints over purely implicit learning. Quantitative evaluations demonstrate significant improvements in generating sequences up to 1000 tokens. The CAST framework reduced the structural error from 0.58 (baseline) to 0.22. Additionally, it increased chord generation accuracy to 96%. These results indicate an effective resolution to the logical collapse problem in long-sequence generation. Furthermore, mechanism analysis reveals significant functional backtracking patterns within the model. This confirms that our method guides the model to acquire deep harmonic grammar logic, thereby generating complex musical works that are structurally rigorous and stylistically unified.
Yang Yalan (Tue,) studied this question.