August 15, 2024Open Access

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Key Points

Key points are not available for this paper at this time.

Abstract

Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model's capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Lastly, TFB boosts the temporal consistency of latent features. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our video demo, code and model are available at https://360cvgroup.github.io/FancyVideo/.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jiasong Feng

Art Institute of Portland

Ao Ma

Xinjiang Agricultural University

Jing Wang

Beihua University

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study