Key points are not available for this paper at this time.
With the popularity of artificial intelligence models and the increasing expectation of artificial intelligence applications in many fields, reference image segmentation (RIS) has attracted much attention from researchers. RIS, as one of the most basic and challenging visual language cross-modal tasks in the intersection of computer vision and natural language processing, aims to segment an instance from an image corresponding to a given natural language representation. This paper aims to provide an overview as comprehensive as possible, covering the mainstream benchmark datasets and their statistic information, common evaluation metrics, a few crucial and representative works in RIS, and the performance evaluation of each proposed method. Included RIS methods are elaborated with their core model structure and procedure in performing RIS, and are categorized into 5 classes in this paper based on how multimodal information is processed. At the end of this paper, the author makes a brief expectation of possible future expansions on the research of RIS.
Honglin Wang (Mon,) studied this question.