Key points are not available for this paper at this time.
Producing quality segmentation masks for images is a fundamental problem in computer vision. Recent research has explored large-scale supervised training to enable zero- shot transfer segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any anno-tations is still challenging. In this paper, we propose to uti-lize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet ef-fective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not re-quire any training or language dependency to extract qual-ity segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot trans-fer SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU. The project page is at https://sites.google.com/view/diffseg/home.11Georgia Institute of Technology
Tian et al. (Sun,) studied this question.