With the rapid advancement of digital media, video content has become a primary medium for information dissemination and entertainment. Background music plays a crucial role in enhancing emotional immersion and guiding narrative rhythm. However, generating high-quality music for long videos remains challenging due to complex scene transitions and diverse emotional tones. This paper introduces a novel hierarchical multi-agent framework that generates semantically consistent, temporally aligned, and stylistically coherent music for long videos. Our approach integrates storyboard-based semantic structuring, a dual-path feature fusion mechanism, and a closed-loop self-correction strategy. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in audio quality, semantic consistency, and temporal alignment, setting a new standard for automatic music generation in long videos. Code is released via https://github.com/Zyp994/MAC.
Yipin Zhao (Tue,) studied this question.