While deep generative models, such as text-to-image diffusion, demonstrate strong capabilities in synthesizing photorealistic images, they frequently produce perceptual artifacts (e.g., distorted structures or unnatural textures) that require manual correction. Existing artifact localization methods typically rely on fully supervised training with large-scale pixel-level annotations, which suffer from high labeling costs. To address these challenges, we propose a novel framework based on the core insight that perceptual artifacts can be fundamentally modeled as “semantic outliers”—regions that inherently fail to match any pre-defined semantic categories. Instead of learning specific artifact features, we introduce a Mask-based Semantic Rejection (MSR) mechanism within a semantic segmentation architecture. This mechanism leverages the “one-vs-all” property of object queries to identify regions that are consistently rejected by all pre-trained semantic categories. Furthermore, we design a flexible adaptation strategy that supports both zero-shot inference using pre-trained semantic knowledge and fine-tuning with a margin-based suppression objective to explicitly optimize the rejection boundary using minimal supervision. Comprehensive experiments across 11 synthesis tasks demonstrate that MSR significantly outperforms state-of-the-art methods, particularly in data-efficient scenarios. Specifically, the framework achieves mIoU improvements of 6.52% and 13.06% on the text-to-image task using only 10% and 50% of labeled samples, respectively, underscoring its superior capability.
Zijin Yin (Tue,) studied this question.