August 17, 2025Open Access

Referring Semantic Segmentation With Implicit Patch Aligned Distillation Learning

Key Points

RiPAD achieves comparable performance on three public datasets, outperforming existing methods in efficiency and accuracy.
Aligning patch tokens with region-guided features improves the model's ability to handle open-vocabulary semantic segmentation.
The method leverages contrastive learning techniques to enhance the accuracy of segmentation without mask annotations.
Utilizing distillation learning, the model effectively associates object features with text embeddings for improved segmentation outcomes.

Abstract

ABSTRACT The CLIP model has demonstrated impressive zero‐shot open‐vocabulary classification capabilities. Several strategies have been proposed for open‐vocabulary segmentations without mask annotations, either by explicit patch tokens alignment or a grouping method with large‐scale contrastive learning. To balance the performance and training efficiency, we propose a referring semantic segmentation model with implicit patch‐aligned distillation learning (RiPAD). RiPAD associates the object‐ness features from the region‐based method with the patch tokens from the vision encoder, then the patch tokens with region‐guidance are aligned with the text embedding from the text encoder by distillation. RiPAD achieves comparable performance on three public datasets and establishes a new baseline for open‐vocabulary semantic segmentation based on CLIP without mask annotations.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper