What question did this study set out to answer?

This research aims to improve the transferability and generalization of weakly supervised semantic segmentation using a novel dual-graph model.

April 26, 2026

CLIP Graph Adaptor: A Dual-Graph Adapted Visual–Language Model for Weakly Supervised Semantic Segmentation

Key Points

This research aims to improve the transferability and generalization of weakly supervised semantic segmentation using a novel dual-graph model.
Developed a dual-graph adaptive strategy with textual and visual subgraphs
Implemented cross-modal graph attention for effective fusion of visual and textual data
Utilized specialized loss functions and superpixel consistency for optimized CAM generation
CLIP-GA significantly improved initial class activation maps compared to previous methods
Demonstrated enhanced accuracy on PASCAL VOC 2012 and MS COCO 2014 datasets
Achieved better object region capturing and reduced background activation

Abstract

Recent advancements in weakly supervised semantic segmentation (WSSS) have shown promise by using the contrastive language-image pretraining (CLIP) model to generate pseudo-labels. However, directly applying the CLIP model without considering interclass relationships in downstream tasks has resulted in suboptimal transferability and generalization. To address these challenges, we propose CLIP graph adapter (CLIP-GA), a novel approach that integrates both textual and visual structural knowledge to generate high-quality initial class activation maps (CAMs) for each object class. Our method introduces a dual-graph adaptive strategy, comprising a textual subgraph and a visual subgraph and employs cross-modal graph attention (CGA) for effective fusion. The framework includes three specialized loss functions that help to capture more complete object regions while minimizing the activation of background areas closely related to foreground objects. In addition, we implement the superpixel consistency to refine pseudo-labels and introduce a graph reasoning attention (GRA) module to build global contextual relationships within visual features for the segmentation network. Extensive experiments on the PASCAL VOC 2012 and MS COCO 2014 datasets have convincingly demonstrated the effectiveness of CLIP-GA compared with other state-of-the-art methods. Our code is provided at: https://github.com/JIA-ZHANG666/CLIP-GA.

Bookmark

Cite This Study

Zhang et al. (Thu,) studied this question.

synapsesocial.com/papers/69edab424a46254e215b366b https://doi.org/https://doi.org/10.1109/tnnls.2026.3683363

Bookmark