Automated vision inspection is vital in modern manufacturing, but advanced processes with high yield rates cause a severe data imbalance: abundant normal data and scarce defective data. To overcome this, we propose CLARIS (Control-based Language-guided Realistic Imperfection Synthesis), a novel framework combining natural language semantic flexibility with 3D structural constraints to generate physically consistent, high-quality defect images. CLARIS utilizes a Vision-Language Model (VLM) to interpret user instructions and input images, dynamically generating tailored text prompts and defect masks. Subsequently, ControlNet ensures the synthesized defects adhere to the object's physical shape and surface curvature by explicitly applying normal maps as constraints. Furthermore, Textual Inversion (TI) and Low-Rank Adaptation (LoRA) are employed to efficiently learn and reflect the unique visual characteristics of specific defects using minimal parameters. Evaluated on the 15 categories of the MVTec Anomaly Detection (MVTec AD) dataset, the framework achieved an average Kernel Inception Distance (KID) of 11.07, Inception Score (IS) of 1.63, and intra-cluster pairwise LPIPS distance (IC-LPIPS) of 0.27.
Kim et al. (Thu,) studied this question.