Apple leaf diseases (ALD) pose a significant challenge to global apple production, and accurately identifying ALD is crucial for reducing pesticide use and improving fruit quality, particularly in the context of smart agriculture. However, traditional approaches rely on single-model feature extraction, failing to account for relationships between different tasks, which limits their applicability in the apple industry. To address this, we design an optimized convolutional neural network–vision transformer (CNN–ViT) hybrid framework named ResViT-AM, focusing on domain-specific enhancement rather than architectural novelty. Instead of proposing a completely new structure, this work refines existing CNN–Transformer paradigms through task-oriented feature fusion and adaptive attention weighting, tailored for apple leaf disease classification under complex orchard conditions. Using a weighted attention fusion mechanism, our model dynamically integrates features extracted by Residual Network 101 (ResNet-101) and vision transformer (ViT), combining proven architectures in a task-adaptive way rather than pursuing architectural innovation, blending the local convolutional details of Residual Network (ResNet) with the global contextual features of ViT. This approach enhances the model’s representation capability and allows parallel processing of multiple tasks, thereby saving training time and computational resources. Additionally, we evaluate on the public AppleLeaf dataset, which reflects real-world outdoor conditions. On its held-out test split, our model achieves 99.14% top-1 accuracy on the AppleLeaf test split, indicating promising performance under complex orchard conditions. Compared with representative convolutional baselines, ResViT-AM shows greater stability and adaptability on challenging cases, offering a competitive and practical solution for automated apple leaf disease diagnosis.
Fang et al. (Tue,) studied this question.