Plant diseases pose a significant threat to global agriculture, impacting crop yields and quality. Early and accurate detection is essential for effective health management but remains challenging due to visual similarity among diseases and complex field backgrounds. This study introduces AgriMM, a novel multi-modal detection framework that integrates visual images with expert-validated textual descriptions to improve diagnostic precision. The framework features three key innovations: a Hybrid Convolutional-Attention Collaborative Backbone (HCACB) to capture both fine-grained lesions and global context; a Context-enhanced Visual-Language Path Aggregation Network (CVL-PAN) for multi-scale feature fusion; and an Adaptive Region-Text Contrastive Learning (AR-TCL) module to enforce precise semantic alignment. We constructed a comprehensive dataset comprising 30,000 images and detailed symptom descriptions across five major crops (tomato, cucumber, pepper, eggplant, and squash). Experimental results demonstrate that AgriMM achieves a mean Average Precision (mAP) of 95.2%, significantly outperforming state-of-the-art unimodal baselines by 11.6%. These findings confirm that integrating linguistic semantic priors effectively resolves visual ambiguity, providing a robust tool for precision agriculture and sustainable crop protection.
Wang et al. (Tue,) studied this question.