What does this research mean for the field?

The proposed hybrid self-supervised learning framework significantly improves feature extraction accuracy and generalization ability for high-resolution remote sensing images compared to existing frameworks. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to improve feature extraction and representation in remote sensing images using a hybrid self-supervised learning framework.

February 19, 2026Open Access

Contrastive Masked Feature Modeling for Self-Supervised Representation Learning of High-Resolution Remote Sensing Images

Key Points

The aim is to improve feature extraction and representation in remote sensing images using a hybrid self-supervised learning framework.
Developed a hybrid SSL framework combining contrastive learning and masked modeling.
Implemented a parallel branch structure: one for global feature representation and another for local detail analysis.
Utilized a hybrid CNN+Transformer architecture for better integration of features.
Achieved superior feature extraction ability in small-sample scenarios.
Outperformed state-of-the-art SSL frameworks on large-scale datasets.

Abstract

As an emerging learning paradigm, self-supervised learning (SSL) has attracted extensive attention due to its ability to mine features with effective representation from massive unlabeled data. In particular, SSL, driven by contrastive learning and masked modeling, shows great potential in general visual tasks. However, because of the diversity of ground target types, the complexity of spectral radiation characteristics, and changes in environmental conditions, existing SSL frameworks exhibit limited feature extraction accuracy and generalization ability when applied to complex remote sensing scenarios. To address this issue, we propose a hybrid SSL framework that integrates the advantages of contrastive learning and masked modeling to extract more robust and reliable features from remote sensing images. The proposed framework includes two parallel branches: one branch uses a contrastive learning strategy to strengthen global feature representation and capture image structural information by constructing positive and negative sample pairs; the other branch adopts a masked modeling strategy, focusing on the fine analysis of local details and predicting the features of masked areas, thereby establishing connections between global and local features. Additionally, to better integrate local and global features, we adopt a hybrid CNN+Transformer architecture, which is particularly suitable for intensive downstream tasks such as semantic segmentation. Extensive experimental results demonstrate that the proposed framework not only exhibits superior feature extraction ability and higher accuracy in small-sample scenarios but also outperforms state-of-the-art mainstream SSL frameworks on large-scale datasets.

Contrastive Masked Feature Modeling for Self-Supervised Representation Learning of High-Resolution Remote Sensing Images

Key Points

Abstract

Cite This Study