Referring remote sensing image segmentation (RRSIS) aims to accurately segment target objects in remote sensing images based on natural language instructions. Despite its growing relevance, progress in this field is constrained by limited datasets and weak cross-modal alignment. To support RRSIS research, we construct referring image segmentation in optical remote sensing (RISORS), a large-scale benchmark containing 36,697 instruction–mask pairs. RISORS provides diverse and high-quality samples that enable comprehensive experiment in remote sensing contexts. Building on this foundation, we propose Referring-SAM (RSAM), a novel framework that extends Segment Anything Model 2 to support text-prompted segmentation. RSAM integrates a Two-Way Guidance Module (TWGM) and a Multimodal Mask Decoder (MMMD). TWGM facilitates a two-way guidance mechanism that mutually refines image and text features, with positional encodings incorporated across all attention layers to significantly enhance relational reasoning. MMMD effectively separates textual prompts from spatial prompts, improving segmentation accuracy in complex multimodal settings. Extensive experiments on RISORS, as well as RefSegRS and RRSIS-D datasets, demonstrate that RSAM achieves state-of-the-art performance, particularly in segmenting small and diverse targets. Ablation studies further validate the individual contributions of TWGM and MMMD. This work provides a solid foundation for further developments in integrated vision-language analysis within remote sensing applications.
Zhao et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: