Objective This study aimed to evaluate and compare the performance of eight contemporary LLMs on the endodontics section of the DUS, assessing their accuracy in both theoretical knowledge and simulated clinical scenarios from historical exam data. Methods The performance of eight different large language models (Claude 4, DeepSeek V3, Gemini 2.5 Pro, ChatGPT-4o, ChatGPT-5, Grok 4, LLaMA 4, and Perplexity) was evaluated using 127 multiple-choice endodontics questions from the Specialization Exam in Dentistry (DUS) administered by the Student Selection and Placement Center (ÖSYM) between 2012 and 2021. The models’ responses were compared against the official answer keys. Statistical analyses were performed using Pearson’s chi-square and McNemar tests, with a significance level of α = 0.05. Results Significant differences existed among LLMs in overall accuracy (p 0.05). Conclusion Contemporary LLMs demonstrate substantial competence in endodontic knowledge, with Gemini 2.5 Pro excelling in both theoretical and clinical queries. However, significant performance variability across models (61.4%−90.6%) and the complexity of retrieving and resolving clinical exam queries necessitate domain-specific optimization and expert oversight for reliable integration into dental education and practice.
Başkan et al. (Mon,) studied this question.