Weakly Supervised Semantic Segmentation (WSSS) aims to achieve pixellevel scene understanding using coarse-grained annotations such as imagelevel labels, thereby reducing the reliance on expensive pixel-level supervision. However, existing methods still face challenges such as incomplete object activation and background confusion. To address these issues, this paper proposes Dual-Axis and Group-normalized Network (DAGNet), an end-to-end framework based on Contrastive Language-Image Pretraining (CLIP), to enhance feature representation and pseudo-label quality. DAGNet integrates two core modules: the Dual-Axis Attention Fusion Module (DAAF), which achieves semantic-consistent feature fusion through adaptive modeling of channelspatial dual-axis attention; and the Grouped Spatial Normalization Module (GSN), which optimizes spatial saliency and enhances fine-grained context awareness. Furthermore, this paper introduces a collaborative optimization strategy to further stabilize the training process and suppress pseudo-label noise. Extensive experiments demonstrate that DAGNet achieves the current state-of-the-art performance without additional supervision, improving mIoU by 2.0% and 1.0% on PASCAL VOC 2012 and MS COCO 2014 datasets, respectively, compared to Weakly-supervised Semantic Segmentation with CLIP (WeCLIP), validating the effectiveness and robustness of the proposed method. The code is available at https://github.com/xm24080854037-eng/DAGNet.git
Shao et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: