This presentation proposes a cascaded symbolic music generation method based on bar-wise feature sequence modeling to improve the modeling performance of global musical structure. The proposed framework consists of three modules: (i) a bar-wise feature extractor, which extracts feature vectors from individual bars of existing music, (ii) a bar-wise composer, which generates symbolic music for each bar conditioned on its corresponding bar-wise feature, and (iii) a bar-wise feature sequence generator, which models the temporal dependencies of bar-wise features to capture global musical structure. By explicitly separating the global structure modeling from the local generation of symbolic music, the proposed architecture not only enables flexible generation but also improves the interpretability of musical structure, as each module corresponds to a distinct musical role. We compared several training configurations, including separately trained modules and jointly trained modules, to investigate the effect of training strategy on generation quality. Experimental results demonstrate that the proposed method enhances global structural coherence and musical naturalness in long-form compositions. However, the results also suggest that further improvement is needed in terms of musicality and creativity. Work supported by JST AIP Acceleration Research JPMJCR25U5, Japan.
Sawada et al. (Wed,) studied this question.