As Virtual Try-On (VTON) technology matures, 2D VTON methods based on diffusion models can now rapidly generate diverse and high-quality try-on results. However, with rising user demands for realism and immersion, many applications are shifting towards 3D VTON, which offers superior geometric and spatial consistency. Existing 3D VTON approaches commonly face challenges such as barriers to practical deployment, substantial memory requirements, and cross-view inconsistencies. To address these issues, we propose an efficient 3D VTON framework with robust multi-view consistency, whose core design is to decouple the monolithic 3D editing task into a four-stage cascade as follows: (1) We first reconstruct an initial 3D scene using 3D Gaussian Splatting, integrating the SMPL-X model at this stage as a strong geometric prior. By computing a normal-map loss and a geometric consistency loss, we ensure the structural stability of the initial human model across different views. (2) We employ the lightweight CatVTON to generate 2D try-on images, that provide visual guidance for the subsequent personalized fine-tuning tasks. (3) To accurately represent garment details from all angles, we partition the 2D dataset into three subsets—front, side, and back—and train a dedicated LoRA module for each subset on a pre-trained diffusion model. This strategy effectively mitigates the issue of blurred details that can occur when a single model attempts to learn global features. (4) An iterative optimization process then uses the generated 2D VTON images and specialized LoRA modules to edit the 3DGS scene, achieving 360-degree free-viewpoint VTON results. All our experiments were conducted on a single consumer-grade GPU with 24 GB of memory, a significant reduction from the 32 GB or more typically required by previous studies under similar data and parameter settings. Our method balances quality and memory requirement, significantly lowering the adoption barrier for 3D VTON technology.
Wang et al. (Tue,) studied this question.