Building extraction from high-resolution remote sensing imagery holds significant value for urban planning, disaster assessment, and geospatial analysis. However, current semantic segmentation models still face limitations when handling complex scenarios characterized by diverse building morphologies, significant scale variations, and blurred boundaries. To address the challenges of insufficient long-range dependency modeling, suboptimal multi-scale feature representation, and weak spatial adaptability, this paper proposes a building extraction network that integrates multi-scale sequence modeling with spatial adaptive enhancement. Adopting UPerNet (equipped with ConvNeXt-Tiny) as the baseline framework, the proposed method introduces a dedicated PyramidSSM-based neck (PyramidSSMNeck) as the primary design for multi-scale feature alignment and fusion, and further integrates three enhancement components (S6 (SSM-based), LSKNet, and SAFM) that provide additional improvements mainly reflected in boundary delineation. Specifically, PyramidSSMNeck performs structured cross-scale feature projection, alignment, and aggregation to strengthen multi-scale representation; S6 enhances long-range contextual modeling, LSKNet adaptively adjusts spatial receptive fields to accommodate scale variations, and SAFM modulates feature responses with spatial cues to refine boundaries and fine details—forming a unified framework in which PyramidSSMNeck primarily drives multi-scale alignment and fusion, while S6, LSKNet, and SAFM further enhance long-range context modeling and spatial adaptivity, mainly benefiting boundary preservation and fine-detail integrity. Experiments were conducted on the WHU Building, INRIA, and a self-constructed Ganzhou urban dataset, and the results indicate that the proposed method achieved IoU scores of 91.29%, 81.96%, and 88.18% across the three datasets, outperforming the baseline UPerNet (ConvNeXt-Tiny) by 2.37%, 0.88%, and 3.68%, respectively, with F1-scores consistently exceeding 90%. Importantly, ablation results indicate that the majority of region-level gains (IoU/F1) come from PyramidSSMNeck, whereas the additional modules contribute more prominently to boundary quality, yielding a Boundary IoU increase from 63.29% to 65.63% (+2.34) from the neck-only setting to the full model. Visualization results further support the method’s advantages in boundary preservation and detail integrity, and additional cross-domain transfer experiments (zero-shot and few-shot from WHU to Ganzhou) suggest improved robustness under domain shift.
Zuo et al. (Thu,) studied this question.