The utilization of Artificial Intelligence-Generated Content (AIGC) has attracted widespread attention in video content creation. To generate high-quality videos, this paper presents a controllable multimodal fusion architecture for AIGC-driven short-video production. This architecture employs hierarchical constraint mechanisms and a multimodal attention fusion mechanism to enhance video content coherence and user controllability. Specifically, a scene coherence scheme is first designed to construct graph-based global and transition-level constraints by integrating text descriptions, reference images, and audio features. By leveraging the extracted style vector data, preliminary video clips are then generated through a combination of the cross-modal fusion unit and the spatio-temporal consistency unit. Finally, a fine-grained adjustment mechanism is implemented to ensure logical consistency and stylistic uniformity in the AIGC-generated videos. Experimental results indicate that the proposed architecture improves generation quality, controllability, and cross-segment coherence under the adopted evaluation settings.
Zhu et al. (Wed,) studied this question.