With the rapid development of remote sensing technology, high-resolution remote sensing images have shown great application value in natural resource planning, environmental monitoring, fire rescue and other fields due to their rich spatial detail information. However, such images contain diverse land feature types and complex spatial distributions. Traditional algorithms are prone to losing detailed information during feature extraction and lack adaptability to complex scenes, resulting in land feature classification accuracy and map feature extraction efficiency failing to meet practical needs. This paper uses the Transformer model as the core and constructs a technical framework of "feature enhancement - multi-scale fusion - accurate classification and extraction": First, the location encoding module of Transformer is improved to adapt to the spatial characteristics of remote sensing images; second, a multi-scale feature fusion unit is designed, combining the advantages of CNN local feature extraction with the global dependency modeling capability of Transformer; finally, an adaptive loss function is proposed to optimize the model training process. Experiments were conducted using the publicly available high-resolution remote sensing dataset WHU-SEN-City and a self-made UAV image dataset. Results show that the proposed MSA-ST algorithm demonstrates significant advantages in the classification and extraction of multiple land cover types: in building classification, its boundary localization accuracy reaches 91.2%, a 5.6 percentage point improvement over Swin Transformer’s 85.6%, enabling efficient land cover identification and feature extraction in complex scenes.
Yanan Liu (Thu,) studied this question.