Abstract This study presents a systematic investigation of hybrid MetaFormer-based deep learning architectures for the classification of olive fruit diseases caused by olive fly ( Bactrocera oleae ) damage and fungal infections, which frequently co-occur in real agricultural settings. Olive production, which holds strategic economic, ecological, and cultural importance in Mediterranean regions, is increasingly threatened by these factors, leading to significant yield and quality losses. Early and accurate disease detection is therefore essential for effective disease management and sustainable olive production. Building upon the MetaFormer framework, this work proposes a modular and task-adaptive architectural paradigm in which heterogeneous token-mixing blocks—identity mapping, random mixing, separable convolution, and self-attention—are systematically combined and sequentially ordered to form hybrid architectures. Rather than introducing a single fixed model, six different MetaFormer-based hybrid configurations are developed to explore how block composition and ordering influence performance, robustness, and computational efficiency. All proposed models are trained from scratch under identical experimental conditions and compared against established baseline architectures, including CAFormer-S18, ConvFormer-S18, and PoolFormerV2-S12. Experimental results demonstrate that several hybrid configurations achieve strong classification performance, with accuracies up to 97.98% and macro F1-scores approaching 0.98, outperforming or matching baseline models while using substantially fewer parameters. In addition to standard evaluation, robustness under realistic domain shifts—such as blur, illumination changes, and colour distortions—is explicitly assessed, revealing that certain block orderings provide improved generalization stability under distribution shifts. Furthermore, a comprehensive resource-efficiency analysis shows that the proposed models offer a favourable trade-off between accuracy and computational cost, operating with significantly lower parameter counts and competitive inference latency. The ability to achieve high performance and robustness without relying on transfer learning highlights the effectiveness of task-adaptive MetaFormer designs in limited-data scenarios. Overall, this study demonstrates that treating MetaFormer as a configurable block-composition framework enables the development of lightweight, robust, and explainable architectures suitable for real-world agricultural applications. The findings provide valuable insights into architectural design strategies for data-constrained visual recognition tasks and lay the foundation for future research on task-driven and adaptive MetaFormer-based systems.
Erdurak et al. (Thu,) studied this question.