March 28, 2024Open Access

Achieving High-Quality Text and Audio-to-Image Generation in a Single Step

Key Points

Key points are not available for this paper at this time.

Abstract

Diffusion models have significantly advanced text-to-image generation by producing high-quality and imaginative results. However, their multi-step sampling process often proves slow, requiring extensive inference steps to achieve satisfactory outcomes. Despite attempts to improve sampling speed and computational efficiency through distillation, creating a functional one-step model has remained elusive. In this study, we investigate Rectified Flow, a recent method primarily applied to small datasets, as a potential solution. Central to Rectified Flow is its reflow procedure, which optimizes probability flow trajectories, refines noise-to- image mapping, and enables effective distillation with student models. We introduce a novel text-conditioned pipeline to convert Stable Diffusion (SD) into an ultra-fast one-step model. Our approach underscores the crucial role of reflow in enhancing noise-to-image assignments. Leveraging this pipeline, we develop the first one-step diffusion-based text-to- image generator capable of producing high-quality images comparable to those generated by SD. Additionally, we extend our methodology to include audio inputs, demonstrating its efficacy in generating images from audio cues with remarkable fidelity and speed. Key Words: Stable Diffusion, Text-to-Image Creation, Image Processing

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Satish Karanjekar (Thu,) studied this question.

synapsesocial.com/papers/68e720ceb6db64358769a123 https://doi.org/https://doi.org/10.55041/ijsrem29789

Bookmark

View Full Paper