Tomato leaf diseases exhibit subtle inter-class differences and substantial intra-class variability, making accurate identification challenging for conventional deep learning models, especially under real-world conditions with diverse lighting, occlusion, and growth stages. Moreover, most existing approaches rely solely on visual features and lack the ability to incorporate semantic descriptions or expert knowledge, limiting their robustness and interpretability. To address these issues, we propose LPDiag, a multimodal prototype-attention diagnostic framework that integrates large language models (LLMs) for fine-grained recognition of tomato diseases. The framework first employs an LLM-driven semantic understanding module to encode symptom-aware textual embeddings from disease descriptions. These embeddings are then aligned with multi-scale visual features extracted by an enhanced Res2Net backbone, enabling cross-modal representation learning. A set of learnable prototype vectors, combined with a knowledge-enhanced attention mechanism, further strengthens the interaction between visual patterns and LLM prior knowledge, resulting in more discriminative and interpretable representations. Additionally, we develop an interactive diagnostic system that supports natural-language querying and image-based identification, facilitating practical deployment in heterogeneous agricultural environments. Extensive experiments on three widely used datasets demonstrate that LPDiag achieves a mean accuracy of 98.83%, outperforming state-of-the-art models while offering improved explanatory capability. The proposed framework offers a promising direction for integrating LLM-based semantic reasoning with visual perception to enhance intelligent and trustworthy plant disease diagnostics.
Dong et al. (Thu,) studied this question.