What question did this study set out to answer?

The aim is to generate high-quality, coherent music for long videos that is semantically consistent and aligned with the video's narrative rhythm.

May 29, 2026Open Access

Multi-agent collaboration for coherent long-video music synthesis

Key Points

The aim is to generate high-quality, coherent music for long videos that is semantically consistent and aligned with the video's narrative rhythm.
Developed a hierarchical multi-agent framework for music synthesis.
Integrated storyboard-based semantic structuring and dual-path feature fusion.
Implemented a closed-loop self-correction strategy to enhance music quality.
Outperformed state-of-the-art methods in audio quality based on benchmark datasets.
Achieved significant improvements in semantic consistency and temporal alignment of the generated music.

Abstract

With the rapid advancement of digital media, video content has become a primary medium for information dissemination and entertainment. Background music plays a crucial role in enhancing emotional immersion and guiding narrative rhythm. However, generating high-quality music for long videos remains challenging due to complex scene transitions and diverse emotional tones. This paper introduces a novel hierarchical multi-agent framework that generates semantically consistent, temporally aligned, and stylistically coherent music for long videos. Our approach integrates storyboard-based semantic structuring, a dual-path feature fusion mechanism, and a closed-loop self-correction strategy. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in audio quality, semantic consistency, and temporal alignment, setting a new standard for automatic music generation in long videos. Code is released via https://github.com/Zyp994/MAC.

Mark Helpful

Bookmark

Relay

View Full Paper