What question did this study set out to answer?

This research aims to assess the diagnostic and reasoning abilities of advanced large language models on the KDLE and their role as educational tools in dentistry.

March 1, 2026Open Access

Comparative Performance of State-of-the-Art LLMs on the KDLE: A 2025 Benchmark Study

Key Points

This research aims to assess the diagnostic and reasoning abilities of advanced large language models on the KDLE and their role as educational tools in dentistry.
Evaluated 4 state-of-the-art LLMs using official KDLE question sets from 2024 and 2025 (n = 642)
Conducted overall accuracy assessments and modality-level analyses
Applied Cochran's Q test, McNemar's test, and Cohen's kappa for performance analysis
Analyzed both text-only and image-based questions across 13 dental subjects
All models surpassed the passing threshold of 180 points
ChatGPT-4o, Claude-4 Opus, and Gemini 2.5 Pro performed near human examinees
DeepSeek-V3 passed but underperformed compared to peers
Significant performance differences among models (Q = 116.40, p < .001)
Models excelled in text-only questions and performed variably in Oral and Maxillofacial Radiology.

Abstract

To evaluate the diagnostic and reasoning capabilities of 4 state-of-the-art large language models (LLMs) on the Korean Dental Licensing Examination (KDLE) and to assess their potential as educational tools in dentistry. Four LLMs—ChatGPT-4o, Claude-4 Opus, Gemini 2.5 Pro, and DeepSeek-V3—were evaluated using official KDLE question sets from 2024 and 2025 (n = 642 questions total). The primary endpoint was overall accuracy across all items, with modality-level and subject-wise analyses conducted as secondary and exploratory assessments. Questions covered 13 dental subjects and included both text-only and image-based items. Performance was analyzed using Cochran's Q test for overall comparisons, McNemar's test for pairwise contrasts, and Cohen's kappa for inter-model agreement. Statistical significance was set at p 0.05). All models demonstrated superior performance on text-only versus image-based questions. LLMs consistently outperformed students in Oral Biology but underperformed in Oral and Maxillofacial Radiology. Cohen's kappa revealed substantial inter-model agreement (κ = 0.631-0.778). Contemporary LLMs demonstrate competent performance on standardized dental licensing examinations, with 3 models achieving near-human competency. However, persistent limitations in visual interpretation and clinical reasoning suggest their role should remain supplementary to human expertise in dental education and practice. While LLMs show promise as educational tools for exam preparation and knowledge reinforcement, their limitations in visual interpretation and integrative clinical reasoning necessitate continued human oversight in clinical decision-making contexts.

Comparative Performance of State-of-the-Art LLMs on the KDLE: A 2025 Benchmark Study

Key Points

Abstract

Cite This Study