What question did this study set out to answer?

The aim is to evaluate the diagnostic performance and clinical utility of large language models in identifying oral lesions.

February 27, 2026Open Access

Multimodal large language models for oral lesion diagnosis: a systematic review of diagnostic performance and clinical utility

Key Points

The aim is to evaluate the diagnostic performance and clinical utility of large language models in identifying oral lesions.
Conducted systematic searches in multiple databases including PubMed and Embase.
Included studies that utilized LLMs for diagnosis or differential diagnosis of oral lesions.
Assessed diagnostic accuracy, agreement metrics, and clinical applicability.
Performed narrative synthesis due to variability in study designs.
Evaluated risk of bias using an adapted QUADAS-2 methodology.
Included 17 studies with over 1,200 cases analyzed.
Diagnostic accuracy ranged from 25% to 96%, influenced by model version and lesion complexity.
Multimodal inputs improved model performance significantly with Cohen's κ values of up to 0.90.
Advanced LLMs approached expert-level performance in certain tasks but specialists generally performed better overall.
Clinical utility was highest in structuring differential reasoning and supporting communication.

Abstract

Background Diagnosing oral lesions from benign conditions to oral cancer remains challenging due to overlapping visual features and reliance on histopathology. Large language models (LLMs) can integrate textual and visual cues, but their diagnostic accuracy and clinical utility in real decision-making contexts remain uncertain. To systematically evaluate the diagnostic performance, clinical usefulness, and limitations of LLMs in identifying oral lesions. Methods PubMed, CINAHL, Embase, Web of Science, and Google Scholar were searched to 20 July 2025. Eligible studies applied LLMs (e.g., ChatGPT, Gemini, DeepSeek, Copilot, Claude) for diagnosis or differential diagnosis of oral lesions using text, images, or multimodal inputs. Outcomes included diagnostic accuracy, agreement metrics, and qualitative assessments of explanation quality and clinical applicability. Risk of bias was assessed using an adapted QUADAS-2. Narrative synthesis was performed due to heterogeneity. Results Seventeen studies (1,200 cases) were included. Diagnostic accuracy ranged from 25%–96%, varying by model version, input modality, and lesion complexity. Multimodal inputs consistently improved performance, with Cohen's κ up to 0.85–0.90. Advanced models (GPT-4o, DeepSeek-R1, o1-preview) outperformed earlier versions and approached expert performance in some tasks, although specialists generally retained superior Top-1 accuracy. Clinical utility was highest when LLMs were used to structure differential reasoning, highlight red-flag features, and support communication, but limited in tasks requiring fine morphological interpretation or severity grading. Overall risk of bias was low to moderate. Conclusions LLMs demonstrate variable diagnostic performance and context-dependent supportive utility as adjunctive tools in oral lesion assessment, particularly in multimodal settings. They should complement, rather than replace, expert clinical judgment. Future research should prioritize real-world workflow evaluation, standardized prompting strategies, and prospective clinical validation. Systematic Review Registration https://www.crd.york.ac.uk/PROSPERO/view/CRD420251090315 , identifier CRD420251090315.

Bookmark

View Full Paper