What question did this study set out to answer?

The objective is to develop a robust virtual try-on model that maintains garment consistency across different viewpoints and postures.

June 14, 2026

VC-VTON: Towards Across-View and Multi-Posture Driven Virtual Try-On via Spatiotemporal-Aware View Consistency Training

Key Points

The objective is to develop a robust virtual try-on model that maintains garment consistency across different viewpoints and postures.
Proposed VC-VTON approach utilizing a multi-view virtual try-on dataset with complete annotations.
Developed a Twin-UNet baseline (VC-TwinNet) for spatiotemporal-aware view consistency training.
Employed a spatiotemporal-aware view attention module and introduced an across-view consistency loss.
Achieved state-of-the-art results in various evaluations, showing enhanced performance without compromising single-view results.
Demonstrated effective generalization across views and postures using the proposed view attention module.
Validated improvements through extensive experiments, indicating the practical applicability of the components.

Abstract

Virtual try-on (VTON) aims to synthesize specific fashion images dressed in given garments, which possesses great potential in real-world scenarios. Existing methods generally stand on the shoulder of the single-view VTON to train a warping model and then fit the given garments onto the human body under a fixed posture and viewpoint, which often fails to preserve the consistent garment characteristics in across-view and multi-pose guided try-on scenarios due to the lack of both across-view data and effective view consistency training. To alleviate this dilemma, we propose a fresh view consistency-driven VTON task (VC-VTON) and release a multi-view virtual try-on dataset with complete annotation (e.g., viewpoint, text, posture, parsing maps, etc.) to encourage across-view training scenarios. Based on this hard-won dataset, we further propose VC-TwinNet, a Twin-UNet baseline based on spatiotemporal-aware View Consistency training, designed specifically for the challenging task. Specifically, to enable view-aware denoising and sparse-to-continuous view generalization, we introduce RoPE and circle embedding to represent the relative and continuous position relation across viewpoints, serving to distinguish their outfitting appearance and warping states. Afterwards, to implicitly learn the interactions across views under given multiple posture conditions, we further contribute a spatiotemporal-aware view attention module to capture the spatial and temporal details for across-view training. Moreover, we utilize an across-view consistency loss to supervise the model training, to ultimately improve the performance of our VC-VTON. Extensive experiments demonstrate the superiority of our approach and state-of-the-art results on various evaluations without declining single-view performance.1And as for practicality and timeliness, our proposed components are essentially plug-and-play and remain effective in the new DiT-centered paradigm.

AI에게 질문

Bookmark