What question did this study set out to answer?

The research assesses the performance of various LLMs in endodontics exam questions to determine their accuracy and practicality in clinical scenarios.

June 3, 2026Open Access

A comparative analysis of the performance of leading large language models on the endodontics section of the dentistry specialization exam in Türkiye

Key Points

The research assesses the performance of various LLMs in endodontics exam questions to determine their accuracy and practicality in clinical scenarios.
Evaluated eight large language models using 127 multiple-choice questions from the DUS exam.
Compared LLM responses to official answer keys to assess accuracy.
Conducted statistical analyses with Pearson's chi-square and McNemar tests.
Gemini 2.5 Pro had the highest overall accuracy at 90.6%, while ChatGPT-4o scored 61.4%.
In Clinical Practice Questions, Gemini scored 93.9%, significantly higher than ChatGPT-4o's 57.6% (p = 0.019).
For General Knowledge and Concept Questions, Gemini 2.5 Pro (89.4%) outperformed ChatGPT-4o (62.8%; p < 0.001).

Abstract

Objective This study aimed to evaluate and compare the performance of eight contemporary LLMs on the endodontics section of the DUS, assessing their accuracy in both theoretical knowledge and simulated clinical scenarios from historical exam data. Methods The performance of eight different large language models (Claude 4, DeepSeek V3, Gemini 2.5 Pro, ChatGPT-4o, ChatGPT-5, Grok 4, LLaMA 4, and Perplexity) was evaluated using 127 multiple-choice endodontics questions from the Specialization Exam in Dentistry (DUS) administered by the Student Selection and Placement Center (ÖSYM) between 2012 and 2021. The models’ responses were compared against the official answer keys. Statistical analyses were performed using Pearson’s chi-square and McNemar tests, with a significance level of α = 0.05. Results Significant differences existed among LLMs in overall accuracy (p 0.05). Conclusion Contemporary LLMs demonstrate substantial competence in endodontic knowledge, with Gemini 2.5 Pro excelling in both theoretical and clinical queries. However, significant performance variability across models (61.4%−90.6%) and the complexity of retrieving and resolving clinical exam queries necessitate domain-specific optimization and expert oversight for reliable integration into dental education and practice.

A comparative analysis of the performance of leading large language models on the endodontics section of the dentistry specialization exam in Türkiye

Key Points

Abstract

Cite This Study