Key points are not available for this paper at this time.
Recent diffusion-based generative models employ methods such as one-shot fine-tuning an image diffusion model for video generation. However, this leads to long video generation times and suboptimal efficiency. To resolve this long generation time, zero-shot text-to-video models eliminate the fine-tuning method entirely and can generate novel videos from a text prompt alone. While the zero-shot generation method greatly reduces generation time, many models rely on inefficient cross-frame attention processors, hindering the diffusion model's utilization for real-time video generation. We address this issue by introducing more efficient attention processors to a video diffusion model. Specifically, we use attention processors (i.e. xFormers, FlashAttention, and HyperAttention) that are highly optimized for efficiency and hardware parallelization. We then apply these processors to a video generator and test with both older diffusion models such as Stable Diffusion 1.5 and newer, high-quality models such as Stable Diffusion XL. Our results show that using efficient attention processors alone can reduce generation time by around 25%, while not resulting in any change in video quality. Combined with the use of higher quality models, this use of efficient attention processors in zero-shot generation presents a substantial efficiency and quality increase, greatly expanding the video diffusion model's application to real-time video generation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ethan Frakes
Umar Khalid
Chen Chen
University of Central Florida
Building similarity graph...
Analyzing shared references across papers
Loading...
Frakes et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68e65ac3b6db6435875e9753 — DOI: https://doi.org/10.1117/12.3013575