Background Diagnosing oral lesions from benign conditions to oral cancer remains challenging due to overlapping visual features and reliance on histopathology. Large language models (LLMs) can integrate textual and visual cues, but their diagnostic accuracy and clinical utility in real decision-making contexts remain uncertain. To systematically evaluate the diagnostic performance, clinical usefulness, and limitations of LLMs in identifying oral lesions. Methods PubMed, CINAHL, Embase, Web of Science, and Google Scholar were searched to 20 July 2025. Eligible studies applied LLMs (e.g., ChatGPT, Gemini, DeepSeek, Copilot, Claude) for diagnosis or differential diagnosis of oral lesions using text, images, or multimodal inputs. Outcomes included diagnostic accuracy, agreement metrics, and qualitative assessments of explanation quality and clinical applicability. Risk of bias was assessed using an adapted QUADAS-2. Narrative synthesis was performed due to heterogeneity. Results Seventeen studies (1,200 cases) were included. Diagnostic accuracy ranged from 25%–96%, varying by model version, input modality, and lesion complexity. Multimodal inputs consistently improved performance, with Cohen's κ up to 0.85–0.90. Advanced models (GPT-4o, DeepSeek-R1, o1-preview) outperformed earlier versions and approached expert performance in some tasks, although specialists generally retained superior Top-1 accuracy. Clinical utility was highest when LLMs were used to structure differential reasoning, highlight red-flag features, and support communication, but limited in tasks requiring fine morphological interpretation or severity grading. Overall risk of bias was low to moderate. Conclusions LLMs demonstrate variable diagnostic performance and context-dependent supportive utility as adjunctive tools in oral lesion assessment, particularly in multimodal settings. They should complement, rather than replace, expert clinical judgment. Future research should prioritize real-world workflow evaluation, standardized prompting strategies, and prospective clinical validation. Systematic Review Registration https://www.crd.york.ac.uk/PROSPERO/view/CRD420251090315 , identifier CRD420251090315.
Hassanein et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: