Foundation models are transforming remote sensing image analysis, yet research typically focuses on training from scratch rather than combining existing architectures. This study investigates whether vision foundation models can be effectively composed to improve performance across diverse Earth Observation tasks. Using the GEO-Bench framework, we benchmark prominent models, including Prithvi, Hiera, and Dynamic One-For-All (DOFA), across 11 datasets. Our results demonstrate that feature-level ensembling of smaller models can match or exceed the performance of much larger monolithic models across various tasks while remaining more resource-efficient. Centred Kernel Alignment analysis reveals a significant representational gap between general-purpose and geospatial models, suggesting they capture complementary spatial and spectral features and further indicating its potential as a predictive signal for the effectiveness of feature fusion. Furthermore, an ablation study identifies feature concatenation as the most effective fusion strategy for preserving unique information compared to averaging, Feature-wise Linear Modulation (FiLM), or dot-product. These findings suggest that strategic ensembling provides a high-performance, low-cost alternative to traditional large-scale pretraining, serving as a promising approach to advance geospatial foundation model development.
Chuc et al. (Thu,) studied this question.