What question did this study set out to answer?

The aim is to improve pixel-level semantic segmentation using weak image-level labels while tackling challenges like incomplete object activation.

March 13, 2026Open Access

Dagnet: dual-axis attention and grouped normalization for enhanced CLIP-based weakly supervised semantic segmentation

Key Points

The aim is to improve pixel-level semantic segmentation using weak image-level labels while tackling challenges like incomplete object activation.
Introduced a dual-axis attention fusion module for better feature representation.
Developed a grouped spatial normalization module to improve context awareness.
Employed a collaborative optimization strategy to stabilize training and reduce pseudo-label noise.
Experimented on PASCAL VOC 2012 and MS COCO 2014 datasets.
Achieved state-of-the-art performance in weakly supervised semantic segmentation.
Improved mean Intersection over Union (mIoU) by 2.0% on PASCAL VOC 2012.
Enhanced mIoU by 1.0% on MS COCO 2014 compared to previous methods.

Abstract

Weakly Supervised Semantic Segmentation (WSSS) aims to achieve pixellevel scene understanding using coarse-grained annotations such as imagelevel labels, thereby reducing the reliance on expensive pixel-level supervision. However, existing methods still face challenges such as incomplete object activation and background confusion. To address these issues, this paper proposes Dual-Axis and Group-normalized Network (DAGNet), an end-to-end framework based on Contrastive Language-Image Pretraining (CLIP), to enhance feature representation and pseudo-label quality. DAGNet integrates two core modules: the Dual-Axis Attention Fusion Module (DAAF), which achieves semantic-consistent feature fusion through adaptive modeling of channelspatial dual-axis attention; and the Grouped Spatial Normalization Module (GSN), which optimizes spatial saliency and enhances fine-grained context awareness. Furthermore, this paper introduces a collaborative optimization strategy to further stabilize the training process and suppress pseudo-label noise. Extensive experiments demonstrate that DAGNet achieves the current state-of-the-art performance without additional supervision, improving mIoU by 2.0% and 1.0% on PASCAL VOC 2012 and MS COCO 2014 datasets, respectively, compared to Weakly-supervised Semantic Segmentation with CLIP (WeCLIP), validating the effectiveness and robustness of the proposed method. The code is available at https://github.com/xm24080854037-eng/DAGNet.git

Bookmark

View Full Paper