Remote sensing image segmentation poses significant challenges in generalizing to unseen categories during the evaluation phase. Existing open-vocabulary segmentation methods, primarily designed for natural images, struggle to cope with the spatial complexity, scale variation, and high-resolution characteristics of remote sensing imagery. Specifically, scale variations during inference can degrade performance, as the model tends to overfit to fixed-scale patterns encountered during training. This also affects the model's ability to recognize unseen or novel class objects appearing in varying sizes or resolutions during testing. These limitations increase the need for developing open-vocabulary segmentation methods addressing the challenges of geospatial images. In this work, we introduce AerOSeg++ , an open vocabulary segmentation method in remote sensing, focusing on scale-invariant feature learning. We first compute robust image-text correlation features using rotated input images and domain-specific prompts. These are refined via spatial and class refinement blocks, guided by SAM features to enhance spatial consistency. To upscale the refined correlation features, we propose a multi-scale decoder framework that fuses fine-grained texture features with SAM-derived features. By leveraging texture information across multiple receptive fields, AerOSeg++ effectively captures scale-consistent patterns, facilitating accurate segmentation of objects across varying spatial resolutions. Additionally, our training pipeline incorporates ScaleDrop, a computationally efficient parameter-free feature rescaling module ensuring scale-invariant feature representation learning. Our proposed model has shown significant performance gains compared to the state-of-the-art open-vocabulary methods when evaluated on three benchmark datasets for remote sensing - iSAID, DLRSD, and OpenEarthMap. These results highlight the effectiveness of our scale-invariant design and texture-guided multi-scale feature upsampling in handling the challenges of open-vocabulary segmentation in remote sensing imagery.
Dutta et al. (Wed,) studied this question.