Weakly supervised semantic segmentation (WSSS) aims to learn pixel-level semantic concepts from image-level class labels. Due to its simplicity and efficiency in training, end-to-end WSSS approaches have attracted significant attention from the research community. However, the coarse nature of pseudo-label regions remains one of the primary bottlenecks limiting the performance of such methods. To address this issue, we propose class-guided enhanced pseudo-labeling (CEP), a method designed to generate high-quality pseudo-labels for end-to-end WSSS frameworks. Our approach leverages pretrained foundation models, such as contrastive language-image pre-training (CLIP) and segment anything model (SAM), to enhance pseudo-label quality. Specifically, following the pseudo-label generation pipeline, we introduce two key components: a Skip-CAM module and a pseudo-label refinement module. The Skip-CAM module enriches feature representations by introducing skip connections from multiple blocks of CLIP, thereby improving the quality of localization maps. The refinement module then utilizes SAM to refine and correct the pseudo-labels based on the initial class-specific regions derived from the localization maps. Experimental results demonstrate that our method surpasses the state-of-the-art end-to-end approaches as well as many multi-stage competitors.
Zhou et al. (Fri,) studied this question.