December 8, 2025Open Access

RSAM: Vision-Language Two-Way Guidance for Referring Remote Sensing Image Segmentation

Puntos clave

RSAM achieves state-of-the-art performance in segmenting small and diverse targets, enhancing segmentation accuracy.
Extensive experiments on diverse datasets confirm the effectiveness of RSAM in remote sensing applications, particularly in complex scenarios.
The framework integrates a Two-Way Guidance Module and Multimodal Mask Decoder, optimizing feature alignment.
Further developments in integrated vision-language analysis are encouraged by the solid foundation laid by this research.

Resumen

Referring remote sensing image segmentation (RRSIS) aims to accurately segment target objects in remote sensing images based on natural language instructions. Despite its growing relevance, progress in this field is constrained by limited datasets and weak cross-modal alignment. To support RRSIS research, we construct referring image segmentation in optical remote sensing (RISORS), a large-scale benchmark containing 36,697 instruction–mask pairs. RISORS provides diverse and high-quality samples that enable comprehensive experiment in remote sensing contexts. Building on this foundation, we propose Referring-SAM (RSAM), a novel framework that extends Segment Anything Model 2 to support text-prompted segmentation. RSAM integrates a Two-Way Guidance Module (TWGM) and a Multimodal Mask Decoder (MMMD). TWGM facilitates a two-way guidance mechanism that mutually refines image and text features, with positional encodings incorporated across all attention layers to significantly enhance relational reasoning. MMMD effectively separates textual prompts from spatial prompts, improving segmentation accuracy in complex multimodal settings. Extensive experiments on RISORS, as well as RefSegRS and RRSIS-D datasets, demonstrate that RSAM achieves state-of-the-art performance, particularly in segmenting small and diverse targets. Ablation studies further validate the individual contributions of TWGM and MMMD. This work provides a solid foundation for further developments in integrated vision-language analysis within remote sensing applications.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

Zhao et al. (Mon,) studied this question.

synapsesocial.com/papers/69401f062d562116f28f9edd https://doi.org/https://doi.org/10.3390/rs17243960

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo