What question did this study set out to answer?

This work aims to develop a lightweight diffusion network for real-time style transfer on mobile devices, addressing challenges with efficiency and subjectivity of text prompts.

April 26, 2026

Lightweight Diffusion Network for Real-Time Style Transfer on Mobile Devices with Joint Image-Text Interaction

Key Points

This work aims to develop a lightweight diffusion network for real-time style transfer on mobile devices, addressing challenges with efficiency and subjectivity of text prompts.
Proposes an end-to-end style transfer framework with a lightweight diffusion network and joint image-text representation.
Introduces a diffusion GAN hybrid architecture (UFOGen) for single-step image generation.
Develops a lightweight network (FasterVAE) with advanced techniques to reduce computational cost.
Achieves a 512×512 stylized image generation in 1.45 seconds on a Xiaomi 14 Pro device.
Outperforms state-of-the-art methods in SSIM, PSNR, and style loss metrics.
User studies confirm superior style accuracy, detail preservation, and color naturalness.

Abstract

Aiming at the challenges of deploying diffusion models on mobile devices and the subjectivity of textual style descriptions, this paper proposes an end-to-end style transfer framework based on a lightweight diffusion network and joint image-text representation. A CLIP (Contrastive Language-Image Pre-training)-based cross-modal feature extraction scheme is designed to decouple style semantics and detail features from reference images, overcoming the ambiguity of pure text prompts. To enable real-time inference, a diffusion GAN (Generative Adversarial Network) hybrid architecture (UFOGen) is introduced to achieve single-step generation, replacing inefficient multi-step denoising. Furthermore, a lightweight network (FasterVAE, Faster Variational Autoencoder) is developed using separated convolution, transformer layers, key-value projection sharing, and Swish activation, significantly reducing parameters and computational cost. On a Xiaomi 14 Pro mobile device, the framework generates a 512×512 stylized image in 1.45 seconds. Experiments show that our method outperforms state-of-the-art approaches in SSIM, PSNR, and style loss. User studies also confirm its advantages in style accuracy, detail preservation, and color naturalness. This work provides a practical solution for realtime style transfer on resource-constrained platforms, advancing the deployment of diffusion models on mobile devices.

Mark Helpful

Bookmark

Relay