To evaluate the diagnostic and reasoning capabilities of 4 state-of-the-art large language models (LLMs) on the Korean Dental Licensing Examination (KDLE) and to assess their potential as educational tools in dentistry. Four LLMs—ChatGPT-4o, Claude-4 Opus, Gemini 2.5 Pro, and DeepSeek-V3—were evaluated using official KDLE question sets from 2024 and 2025 (n = 642 questions total). The primary endpoint was overall accuracy across all items, with modality-level and subject-wise analyses conducted as secondary and exploratory assessments. Questions covered 13 dental subjects and included both text-only and image-based items. Performance was analyzed using Cochran's Q test for overall comparisons, McNemar's test for pairwise contrasts, and Cohen's kappa for inter-model agreement. Statistical significance was set at p 0.05). All models demonstrated superior performance on text-only versus image-based questions. LLMs consistently outperformed students in Oral Biology but underperformed in Oral and Maxillofacial Radiology. Cohen's kappa revealed substantial inter-model agreement (κ = 0.631-0.778). Contemporary LLMs demonstrate competent performance on standardized dental licensing examinations, with 3 models achieving near-human competency. However, persistent limitations in visual interpretation and clinical reasoning suggest their role should remain supplementary to human expertise in dental education and practice. While LLMs show promise as educational tools for exam preparation and knowledge reinforcement, their limitations in visual interpretation and integrative clinical reasoning necessitate continued human oversight in clinical decision-making contexts.
Kim et al. (Fri,) studied this question.