What question did this study set out to answer?

The aim is to enhance video generation quality and coherence using a controllable multimodal fusion architecture.

April 24, 2026Open Access

AIGC-Driven Short Video Generation Based on the Controllable Multimodal Fusion Architecture

Key Points

The aim is to enhance video generation quality and coherence using a controllable multimodal fusion architecture.
Developed a multimodal fusion architecture for AIGC-driven video production.
Implemented scene coherence schemes with graph-based constraints integrating text, images, and audio.
Utilized a fine-grained adjustment mechanism for consistency and style uniformity.
Improved overall generation quality of AIGC-driven videos.
Enhanced user controllability of video content.
Achieved better coherence across different video segments.

Abstract

The utilization of Artificial Intelligence-Generated Content (AIGC) has attracted widespread attention in video content creation. To generate high-quality videos, this paper presents a controllable multimodal fusion architecture for AIGC-driven short-video production. This architecture employs hierarchical constraint mechanisms and a multimodal attention fusion mechanism to enhance video content coherence and user controllability. Specifically, a scene coherence scheme is first designed to construct graph-based global and transition-level constraints by integrating text descriptions, reference images, and audio features. By leveraging the extracted style vector data, preliminary video clips are then generated through a combination of the cross-modal fusion unit and the spatio-temporal consistency unit. Finally, a fine-grained adjustment mechanism is implemented to ensure logical consistency and stylistic uniformity in the AIGC-generated videos. Experimental results indicate that the proposed architecture improves generation quality, controllability, and cross-segment coherence under the adopted evaluation settings.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Zhu et al. (Wed,) studied this question.

synapsesocial.com/papers/69eb0a2e553a5433e34b44e0 https://doi.org/https://doi.org/10.3390/electronics15091783

Bookmark

View Full Paper