Large language model (LLM)-based chatbots are increasingly used in healthcare, yet their diagnostic accuracy, consistency, and temporal stability in endodontics remain insufficiently evaluated. This study aimed to assess and compare the performance of LLM-based chatbots using established international clinical guidelines. A diagnostic accuracy study with a repeated-measures design was conducted. Two LLMs (ChatGPT-5 and Gemini-2.5 Flash) were evaluated using 200 structured yes/no items derived from consensus-based international position statements covering the full scope of endodontic practice. Each model was tested weekly over three consecutive sessions, yielding 600 responses per model. Accuracy was assessed against reference answers, consistency was analyzed using Fleiss’ κ and Cohen’s κ, and logistic regression was performed to evaluate the effects of model type, week, and clinical domain. ChatGPT demonstrated significantly higher overall accuracy than Gemini (92.8% vs. 84.8%; OR = 2.37; p = 0.004) and near-perfect reproducibility (κ = 0.95), whereas Gemini showed fair agreement (κ = 0.38). Both models performed well in structured clinical domains but showed reduced accuracy in areas related to restoration and pulpal disease management. LLM-based chatbots show potential as decision support and educational tools in endodontics. However, performance limitations in complex clinical domains highlight the continued need for expert oversight in clinical decision-making.
Sismanoglu et al. (Thu,) studied this question.