Recent progress in geospatial foundation models (GFMs) has demonstrated strong generalization capabilities for remote sensing downstream tasks. However, existing GFMs still struggle with fine-grained cropland classification due to ambiguous field boundaries, insufficient and low-efficient temporal modeling, and limited cross-regional adaptability. In this paper, we propose CropSTS, a remote sensing foundation model designed with a decoupled temporal–spatial attention architecture, specifically tailored for the temporal dynamics of cropland remote sensing data. To efficiently pre-train the model under limited labeled data, we employ a hybrid framework combining joint-embedding predictive architecture with knowledge distillation from web-scale foundation models. Despite being trained on a small dataset and using a compact model, CropSTS achieves state-of-the-art performance on the PASTIS-R benchmark in terms of mIoU and F1-score. Our results validate that structural optimization for temporal encoding and cross-modal knowledge transfer constitute effective strategies for advancing GFM design in agricultural remote sensing.
Yan et al. (Thu,) studied this question.