ABSTRACT The CLIP model has demonstrated impressive zero‐shot open‐vocabulary classification capabilities. Several strategies have been proposed for open‐vocabulary segmentations without mask annotations, either by explicit patch tokens alignment or a grouping method with large‐scale contrastive learning. To balance the performance and training efficiency, we propose a referring semantic segmentation model with implicit patch‐aligned distillation learning (RiPAD). RiPAD associates the object‐ness features from the region‐based method with the patch tokens from the vision encoder, then the patch tokens with region‐guidance are aligned with the text embedding from the text encoder by distillation. RiPAD achieves comparable performance on three public datasets and establishes a new baseline for open‐vocabulary semantic segmentation based on CLIP without mask annotations.
Liu et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: