The ability to virtually try on clothing items has become an increasingly important feature for e-commerce and online shopping experiences. Real-time virtual try-on remains challenging because existing methods force a trade-off between speed and quality GAN-based approaches achieve high visual fidelity but at low frame rates, while faster methods sacrifice realism. HQ-RTVF is a diffusion-based framework that resolves this trade-off through three architectural innovations: running the diffusion U-Net entirely in the VAE’s compressed latent space (64×64×4 instead of 512×512×3), limiting denoising to 20 steps with FP16 mixed-precision computation, and parallelizing pose estimation and garment encoding to eliminate sequential bottlenecks. The system uses DensePose and DeepLabv3+ for body pose and segmentation, a CLIP-based garment encoder for fine-grained fabric representation, and an attention-guided fusion decoder that maintains temporal coherence across video frames— distinguishing it from static image methods like VITON-HD and HR-VITON. An adaptive masking mechanism handles diverse garment types from cropped tops to full-length dresses. Evaluated on VITON-HD and DressCode datasets, HQ-RTVF achieves SSIM of 0.950 and LPIPS of 0.067, while operating in real-time with only 4.2 GB GPU memory.
Kachbal et al. (Thu,) studied this question.