March 3, 2026Open Access

AerOSeg++: Scale-Aware and Texture-Guided Open-Vocabulary Segmentation with SAM Features for Remote Sensing Images

Key Points

Accurate segmentation of remote sensing images has significantly improved using AerOSeg++, addressing scale variations effectively.
Performance gains showed around a substantial improvement against state-of-the-art methods on three benchmark datasets for remote sensing.
Assessment focused on incorporating scale-invariant feature learning alongside multi-scale decoder frameworks and texture features.
Highlights the critical need for advancements in open-vocabulary segmentation to accommodate the diverse complexities of geospatial imagery.

Abstract

Remote sensing image segmentation poses significant challenges in generalizing to unseen categories during the evaluation phase. Existing open-vocabulary segmentation methods, primarily designed for natural images, struggle to cope with the spatial complexity, scale variation, and high-resolution characteristics of remote sensing imagery. Specifically, scale variations during inference can degrade performance, as the model tends to overfit to fixed-scale patterns encountered during training. This also affects the model's ability to recognize unseen or novel class objects appearing in varying sizes or resolutions during testing. These limitations increase the need for developing open-vocabulary segmentation methods addressing the challenges of geospatial images. In this work, we introduce AerOSeg++ , an open vocabulary segmentation method in remote sensing, focusing on scale-invariant feature learning. We first compute robust image-text correlation features using rotated input images and domain-specific prompts. These are refined via spatial and class refinement blocks, guided by SAM features to enhance spatial consistency. To upscale the refined correlation features, we propose a multi-scale decoder framework that fuses fine-grained texture features with SAM-derived features. By leveraging texture information across multiple receptive fields, AerOSeg++ effectively captures scale-consistent patterns, facilitating accurate segmentation of objects across varying spatial resolutions. Additionally, our training pipeline incorporates ScaleDrop, a computationally efficient parameter-free feature rescaling module ensuring scale-invariant feature representation learning. Our proposed model has shown significant performance gains compared to the state-of-the-art open-vocabulary methods when evaluated on three benchmark datasets for remote sensing - iSAID, DLRSD, and OpenEarthMap. These results highlight the effectiveness of our scale-invariant design and texture-guided multi-scale feature upsampling in handling the challenges of open-vocabulary segmentation in remote sensing imagery.

AerOSeg++: Scale-Aware and Texture-Guided Open-Vocabulary Segmentation with SAM Features for Remote Sensing Images

Key Points

Abstract

Cite This Study