Recent advancements in weakly supervised semantic segmentation (WSSS) have shown promise by using the contrastive language-image pretraining (CLIP) model to generate pseudo-labels. However, directly applying the CLIP model without considering interclass relationships in downstream tasks has resulted in suboptimal transferability and generalization. To address these challenges, we propose CLIP graph adapter (CLIP-GA), a novel approach that integrates both textual and visual structural knowledge to generate high-quality initial class activation maps (CAMs) for each object class. Our method introduces a dual-graph adaptive strategy, comprising a textual subgraph and a visual subgraph and employs cross-modal graph attention (CGA) for effective fusion. The framework includes three specialized loss functions that help to capture more complete object regions while minimizing the activation of background areas closely related to foreground objects. In addition, we implement the superpixel consistency to refine pseudo-labels and introduce a graph reasoning attention (GRA) module to build global contextual relationships within visual features for the segmentation network. Extensive experiments on the PASCAL VOC 2012 and MS COCO 2014 datasets have convincingly demonstrated the effectiveness of CLIP-GA compared with other state-of-the-art methods. Our code is provided at: https://github.com/JIA-ZHANG666/CLIP-GA.
Zhang et al. (Thu,) studied this question.