Los puntos clave no están disponibles para este artículo en este momento.
Accurate 3-D reconstruction of garments from a single consumer-grade image remains a critical barrier to truly immersive and resource-aware virtual try-on systems. We introduce a self-supervised, multimodal pipeline that fuses visual tokens extracted by a Vision Transformer with textual garment descriptors to synthesise high-fidelity cloth geometry and texture while operating within the stringent power envelope of mobile neural-processing units (NPUs). A hybrid latent-diffusion module generates pseudo-meshes that supervise a lightweight INT8-quantised Mesh-Autoencoder, thereby eliminating the dependence on large annotated 3-D-scan corpora. To compensate for limited real data we construct SyntheCloth-300K, a dataset blending CLO-3D captures with PhysX-driven synthetic variations, and use it for joint visual–textual training. On the DeepFashion3D benchmark our method reduces Chamfer-Distance by 18% and improves SSIM by 0.03 over DressCode-NeRF, while sustaining 21 FPS at 0.32mJvertex−1 on a Snapdragon 8 Gen 3 — tripling the energy efficiency of prior art. Qualitative results reveal robust reconstruction of fine pleats and fabric drape, even under severe self-occlusion. The proposed framework thus bridges computer vision, physically based graphics, and embedded optimisation, laying the groundwork for next-generation, on-device virtual fitting applications.
Chekhmestruk et al. (Sat,) studied this question.