Key points are not available for this paper at this time.
Remote sensing image semantic segmentation is a crucial step in the intelligent interpretation of remote sensing. Most of the current approaches are based on the attention mechanism to enhance long-range representations. However, these works ignore the key problem of foreground-background imbalance, and their performances encounter a bottleneck. In this paper, we introduce mask classification into remote sensing image interpretation for the first time, and propose a novel mixed-mask Transformer (MMT) for remote sensing image semantic segmentation. Specifically, we propose a mixed-mask attention mechanism, a simple but effective module, which assists the network to learn more explicit intraclass and interclass correlations by capturing long-range interdependent representations. In addition, a progressive multi-scale learning strategy is proposed to solve the problem of large scale-varied targets in remote sensing images, which integrates semantic and visual representations of different scale targets by efficiently utilizing large scale feature maps in Transformer. Experimental results show that the proposed MMT exceeds the existing alternative approaches and achieves state-of-the-art performance on three semantic segmentation datasets.
Xu et al. (Sun,) studied this question.