What question did this study set out to answer?

The aim is to develop a real-time virtual try-on system that balances high quality and speed for e-commerce applications.

March 7, 2026Open Access

HQ-RTVF: High-Quality Real-Time Virtual Try-On Fitting for Diverse Clothing and Body Morphologies

Key Points

The aim is to develop a real-time virtual try-on system that balances high quality and speed for e-commerce applications.
Utilized a diffusion-based framework operating in a compressed latent space.
Employed FP16 mixed-precision computation for efficient denoising.
Parallelized pose estimation and garment encoding to improve speed.
Integrated DensePose and DeepLabv3+ for accurate body pose and segmentation.
Used an attention-guided fusion decoder to ensure visual consistency across video frames.
Achieved a Structural Similarity Index (SSIM) of 0.950, indicating high visual quality.
Recorded a Learned Perceptual Image Patch Similarity (LPIPS) of 0.067, demonstrating low perceptual distance.
Operated in real-time with a GPU memory usage of only 4.2 GB.
Performance evaluated using VITON-HD and DressCode datasets showed significant improvements over static methods.

Abstract

The ability to virtually try on clothing items has become an increasingly important feature for e-commerce and online shopping experiences. Real-time virtual try-on remains challenging because existing methods force a trade-off between speed and quality GAN-based approaches achieve high visual fidelity but at low frame rates, while faster methods sacrifice realism. HQ-RTVF is a diffusion-based framework that resolves this trade-off through three architectural innovations: running the diffusion U-Net entirely in the VAE’s compressed latent space (64×64×4 instead of 512×512×3), limiting denoising to 20 steps with FP16 mixed-precision computation, and parallelizing pose estimation and garment encoding to eliminate sequential bottlenecks. The system uses DensePose and DeepLabv3+ for body pose and segmentation, a CLIP-based garment encoder for fine-grained fabric representation, and an attention-guided fusion decoder that maintains temporal coherence across video frames— distinguishing it from static image methods like VITON-HD and HR-VITON. An adaptive masking mechanism handles diverse garment types from cropped tops to full-length dresses. Evaluated on VITON-HD and DressCode datasets, HQ-RTVF achieves SSIM of 0.950 and LPIPS of 0.067, while operating in real-time with only 4.2 GB GPU memory.

Bookmark

View Full Paper

Cite This Study

Kachbal et al. (Thu,) studied this question.

synapsesocial.com/papers/69abc1a65af8044f7a4ea85b https://doi.org/https://doi.org/10.14569/ijacsa.2026.0170291

Bookmark

View Full Paper