What question did this study set out to answer?

The aim is to enhance plant disease detection accuracy by integrating visual and textual data using a novel framework.

March 26, 2026Open Access

Enhancing plant disease detection through multi-modal integration of visual and textual data

Key Points

The aim is to enhance plant disease detection accuracy by integrating visual and textual data using a novel framework.
Developed AgriMM, a multi-modal detection framework combining images and textual symptom descriptions.
Utilized a Hybrid Convolutional-Attention Collaborative Backbone (HCACB) for fine-grained and context capture.
Implemented a Context-enhanced Visual-Language Path Aggregation Network (CVL-PAN) for effective feature fusion.
AgriMM achieved a mean Average Precision (mAP) of 95.2%.
Improved performance by 11.6% over existing unimodal detection methods.
Demonstrated effective resolution of visual ambiguity through linguistic data integration.

Abstract

Plant diseases pose a significant threat to global agriculture, impacting crop yields and quality. Early and accurate detection is essential for effective health management but remains challenging due to visual similarity among diseases and complex field backgrounds. This study introduces AgriMM, a novel multi-modal detection framework that integrates visual images with expert-validated textual descriptions to improve diagnostic precision. The framework features three key innovations: a Hybrid Convolutional-Attention Collaborative Backbone (HCACB) to capture both fine-grained lesions and global context; a Context-enhanced Visual-Language Path Aggregation Network (CVL-PAN) for multi-scale feature fusion; and an Adaptive Region-Text Contrastive Learning (AR-TCL) module to enforce precise semantic alignment. We constructed a comprehensive dataset comprising 30,000 images and detailed symptom descriptions across five major crops (tomato, cucumber, pepper, eggplant, and squash). Experimental results demonstrate that AgriMM achieves a mean Average Precision (mAP) of 95.2%, significantly outperforming state-of-the-art unimodal baselines by 11.6%. These findings confirm that integrating linguistic semantic priors effectively resolves visual ambiguity, providing a robust tool for precision agriculture and sustainable crop protection.

Bookmark

View Full Paper

Cite This Study

Wang et al. (Tue,) studied this question.

synapsesocial.com/papers/69c4ccbbfdc3bde4489182b7 https://doi.org/https://doi.org/10.1186/s13007-026-01521-w

Bookmark

View Full Paper