What question did this study set out to answer?

This research aims to assess how accurately and consistently multimodal large language models diagnose oral mucosal lesions using images.

April 23, 2026Open Access

Exploratory Evaluation of Diagnostic Accuracy and Temporal Reproducibility of Multimodal Large Language Models in the Image-Based Assessment of Oral Mucosal Lesions

Key Points

This research aims to assess how accurately and consistently multimodal large language models diagnose oral mucosal lesions using images.
Analyzed 100 clinical photographs of oral mucosal lesions categorized into four groups.
Utilized three multimodal language models to provide diagnoses based on visual features.
Repeated evaluations across three cycles to test diagnostic consistency over time.
Gemini achieved the highest diagnostic accuracy at 78% in later evaluation cycles, outperforming other models significantly.
Intra-model agreement varied, with ChatGPT showing moderate agreement (κ = 0.525) and Gemini fair agreement (κ = 0.338).
Gemini also retained the highest correct diagnosis rate across evaluation cycles (51%).

Abstract

Objective: The aim was to evaluate the diagnostic accuracy and temporal reproducibility of multimodal large language models (LLMs) in the image-based diagnosis of oral mucosal lesions. Materials and Methods: The study included 100 anonymized clinical photographs of oral mucosal conditions obtained from the archive of the Department of Oral Medicine, School of Dental Medicine, University of Zagreb. Images were categorized into four subgroups: physiological variations, benign mucosal lesions, oral potentially malignant disorders, and oral cancer (25 images each). Three multimodal LLMs (ChatGPT-5.1 Plus, Gemini 3 Pro, and Perplexity Pro) analyzed each image using an identical prompt and were required to provide a single most probable diagnosis based solely on visual features. To evaluate temporal reproducibility, the entire evaluation was repeated in three independent testing cycles conducted at one-month intervals. Diagnostic accuracy was compared using chi-square tests, while intra-model agreement across cycles was assessed using Fleiss’ kappa. Results: Gemini demonstrated the highest diagnostic accuracy, reaching 78% correct responses in cycles 2 and 3, significantly outperforming ChatGPT (55–57%) and Perplexity (28–31%) (p < 0.00001). Subgroup analyses showed similar trends, with Gemini achieving the highest accuracy across most lesion categories. Intra-model agreement across cycles was moderate for ChatGPT (κ = 0.525), fair for Gemini (κ = 0.338) and Perplexity (κ = 0.409). Gemini also showed the highest proportion of responses that remained correct across all three cycles (51%). Conclusions: Multimodal LLMs demonstrate promising diagnostic capabilities in the image-based assessment of oral mucosal lesions; however, variability in reproducibility highlights the need for cautious clinical implementation and further validation.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper