As an emerging learning paradigm, self-supervised learning (SSL) has attracted extensive attention due to its ability to mine features with effective representation from massive unlabeled data. In particular, SSL, driven by contrastive learning and masked modeling, shows great potential in general visual tasks. However, because of the diversity of ground target types, the complexity of spectral radiation characteristics, and changes in environmental conditions, existing SSL frameworks exhibit limited feature extraction accuracy and generalization ability when applied to complex remote sensing scenarios. To address this issue, we propose a hybrid SSL framework that integrates the advantages of contrastive learning and masked modeling to extract more robust and reliable features from remote sensing images. The proposed framework includes two parallel branches: one branch uses a contrastive learning strategy to strengthen global feature representation and capture image structural information by constructing positive and negative sample pairs; the other branch adopts a masked modeling strategy, focusing on the fine analysis of local details and predicting the features of masked areas, thereby establishing connections between global and local features. Additionally, to better integrate local and global features, we adopt a hybrid CNN+Transformer architecture, which is particularly suitable for intensive downstream tasks such as semantic segmentation. Extensive experimental results demonstrate that the proposed framework not only exhibits superior feature extraction ability and higher accuracy in small-sample scenarios but also outperforms state-of-the-art mainstream SSL frameworks on large-scale datasets.
Pang et al. (Tue,) studied this question.