Aiming at the challenges of deploying diffusion models on mobile devices and the subjectivity of textual style descriptions, this paper proposes an end-to-end style transfer framework based on a lightweight diffusion network and joint image-text representation. A CLIP (Contrastive Language-Image Pre-training)-based cross-modal feature extraction scheme is designed to decouple style semantics and detail features from reference images, overcoming the ambiguity of pure text prompts. To enable real-time inference, a diffusion GAN (Generative Adversarial Network) hybrid architecture (UFOGen) is introduced to achieve single-step generation, replacing inefficient multi-step denoising. Furthermore, a lightweight network (FasterVAE, Faster Variational Autoencoder) is developed using separated convolution, transformer layers, key-value projection sharing, and Swish activation, significantly reducing parameters and computational cost. On a Xiaomi 14 Pro mobile device, the framework generates a 512×512 stylized image in 1.45 seconds. Experiments show that our method outperforms state-of-the-art approaches in SSIM, PSNR, and style loss. User studies also confirm its advantages in style accuracy, detail preservation, and color naturalness. This work provides a practical solution for realtime style transfer on resource-constrained platforms, advancing the deployment of diffusion models on mobile devices.
Wang et al. (Thu,) studied this question.