Abstract Accurate extraction of building information from remote sensing imagery is essential for urban planning and management, yet it remains challenging in mountainous regions due to complex terrain, fragmented settlements, and limited annotated data. Existing methods often require extensive manual labeling or struggle to distinguish buildings from vegetation, shadows, and bare land. To address these issues, we propose a framework that leverages multi-spectral and terrain information to automatically generate coarse-grained building masks and corresponding point prompts, which are then used to fine-tune the Segment Anything Model (SAM) originally trained on millions of natural images. This approach enables accurate extraction of urban buildings in mountainous areas of China with minimal manual annotation. On the test dataset from the same region, our method achieves an F1-score of 82.46 % and an IoU of 70.15 %, outperforming the original SAM and EfficientSAM by more than 25 and 30 percentage points, respectively, and surpassing FCN, UNet, Swin Transformer, and DeepLabV3+ by up to 36 and 41 percentage points. On validation datasets from other regions, the method maintains robust performance with F1-scores above 70 % and IoU around 60 %, consistently higher than competing baselines. The framework is efficient, easy to deploy, and provides a significant step toward practical large-scale building extraction in complex terrains.
Su et al. (Thu,) studied this question.