Objective: The aim was to evaluate the diagnostic accuracy and temporal reproducibility of multimodal large language models (LLMs) in the image-based diagnosis of oral mucosal lesions. Materials and Methods: The study included 100 anonymized clinical photographs of oral mucosal conditions obtained from the archive of the Department of Oral Medicine, School of Dental Medicine, University of Zagreb. Images were categorized into four subgroups: physiological variations, benign mucosal lesions, oral potentially malignant disorders, and oral cancer (25 images each). Three multimodal LLMs (ChatGPT-5.1 Plus, Gemini 3 Pro, and Perplexity Pro) analyzed each image using an identical prompt and were required to provide a single most probable diagnosis based solely on visual features. To evaluate temporal reproducibility, the entire evaluation was repeated in three independent testing cycles conducted at one-month intervals. Diagnostic accuracy was compared using chi-square tests, while intra-model agreement across cycles was assessed using Fleiss’ kappa. Results: Gemini demonstrated the highest diagnostic accuracy, reaching 78% correct responses in cycles 2 and 3, significantly outperforming ChatGPT (55–57%) and Perplexity (28–31%) (p < 0.00001). Subgroup analyses showed similar trends, with Gemini achieving the highest accuracy across most lesion categories. Intra-model agreement across cycles was moderate for ChatGPT (κ = 0.525), fair for Gemini (κ = 0.338) and Perplexity (κ = 0.409). Gemini also showed the highest proportion of responses that remained correct across all three cycles (51%). Conclusions: Multimodal LLMs demonstrate promising diagnostic capabilities in the image-based assessment of oral mucosal lesions; however, variability in reproducibility highlights the need for cautious clinical implementation and further validation.
Dumančić et al. (Tue,) studied this question.