What question did this study set out to answer?

February 28, 2026Open Access

A High-Resolution Remote Sensing Building Extraction Network Integrating Multi-Scale Sequence Modeling and Spatial Adaptive Enhancement

Key Points

The research aims to enhance building extraction from high-resolution remote sensing imagery by integrating advanced modeling techniques.
Utilized UPerNet with ConvNeXt-Tiny as the baseline framework
Developed PyramidSSMNeck for multi-scale feature alignment and fusion
Integrated three enhancement components: S6, LSKNet, and SAFM
Conducted experiments on WHU Building, INRIA, and Ganzhou urban datasets
Evaluated performance using Intersection over Union (IoU) and F1-scores
Achieved IoU scores of 91.29%, 81.96%, and 88.18% across the three datasets
Outperformed baseline UPerNet by 2.37%, 0.88%, and 3.68% in IoU scores
F1-scores consistently exceeded 90%
Increased Boundary IoU from 63.29% to 65.63% with the full model
Demonstrated improved robustness under domain shifts in cross-domain transfer experiments

Abstract

Building extraction from high-resolution remote sensing imagery holds significant value for urban planning, disaster assessment, and geospatial analysis. However, current semantic segmentation models still face limitations when handling complex scenarios characterized by diverse building morphologies, significant scale variations, and blurred boundaries. To address the challenges of insufficient long-range dependency modeling, suboptimal multi-scale feature representation, and weak spatial adaptability, this paper proposes a building extraction network that integrates multi-scale sequence modeling with spatial adaptive enhancement. Adopting UPerNet (equipped with ConvNeXt-Tiny) as the baseline framework, the proposed method introduces a dedicated PyramidSSM-based neck (PyramidSSMNeck) as the primary design for multi-scale feature alignment and fusion, and further integrates three enhancement components (S6 (SSM-based), LSKNet, and SAFM) that provide additional improvements mainly reflected in boundary delineation. Specifically, PyramidSSMNeck performs structured cross-scale feature projection, alignment, and aggregation to strengthen multi-scale representation; S6 enhances long-range contextual modeling, LSKNet adaptively adjusts spatial receptive fields to accommodate scale variations, and SAFM modulates feature responses with spatial cues to refine boundaries and fine details—forming a unified framework in which PyramidSSMNeck primarily drives multi-scale alignment and fusion, while S6, LSKNet, and SAFM further enhance long-range context modeling and spatial adaptivity, mainly benefiting boundary preservation and fine-detail integrity. Experiments were conducted on the WHU Building, INRIA, and a self-constructed Ganzhou urban dataset, and the results indicate that the proposed method achieved IoU scores of 91.29%, 81.96%, and 88.18% across the three datasets, outperforming the baseline UPerNet (ConvNeXt-Tiny) by 2.37%, 0.88%, and 3.68%, respectively, with F1-scores consistently exceeding 90%. Importantly, ablation results indicate that the majority of region-level gains (IoU/F1) come from PyramidSSMNeck, whereas the additional modules contribute more prominently to boundary quality, yielding a Boundary IoU increase from 63.29% to 65.63% (+2.34) from the neck-only setting to the full model. Visualization results further support the method’s advantages in boundary preservation and detail integrity, and additional cross-domain transfer experiments (zero-shot and few-shot from WHU to Ganzhou) suggest improved robustness under domain shift.

A High-Resolution Remote Sensing Building Extraction Network Integrating Multi-Scale Sequence Modeling and Spatial Adaptive Enhancement

Key Points

Abstract

Cite This Study