Few-shot semantic segmentation (FSS) aims to predict segmentation masks for unseen objects using only a limited number of annotated samples. Among various approaches, prototype learning has been widely adopted in FSS, where prototype vectors derived from seen categories (support images) are transferred to novel categories (query images) to guide the segmentation of unseen objects. Although prototype-based methods have achieved considerable progress, they still suffer from prototype bias and insufficient utilization of limited multimodal information. To address these issues, we propose a Multimodal-Driven Prototype Evolving Network (MDPENet), designed to enhance prototype representation and generalization. The proposed network primarily consists of three modules: the Support Feature Enhancement Module (SFEM), the Query Feature Disentanglement Module (QFDM), and the Prototype Evolution Module (PEM). Specifically, the SFEM establishes multimodal feature interactions between the text label features encoded by Contrastive Language-Image Pre-training (CLIP) and the separated support foreground features, thereby enhancing the representational quality and robustness of the support features. The QFDM then integrates the CLIP-encoded text label features with the support foreground features to disentangle the whole query feature, effectively reducing semantic interference among mixed query representations. Finally, the PEM evolves and refines the prototype set using the enhanced support and disentangled query foreground features at a fine-grained level. Extensive experiments on the benchmark datasets PASCAL-5 i and COCO-20 i demonstrate the superiority of our MDPENet compared to classical FSS methods.
Ding et al. (Fri,) studied this question.