What question did this study set out to answer?

This study assesses the diagnostic accuracy and consistency of LLM-based chatbots in endodontics by comparing their performance against established guidelines.

April 18, 2026Open Access

Performance of large language models in endodontics: accuracy, consistency, and benchmarking with consensus guidelines

Puntos clave

This study assesses the diagnostic accuracy and consistency of LLM-based chatbots in endodontics by comparing their performance against established guidelines.
Conducted a diagnostic accuracy study with a repeated-measures design.
Evaluated two LLMs (ChatGPT-5 and Gemini-2.5 Flash) using 200 structured yes/no items.
Tested each model weekly over three sessions, yielding 600 responses per model.
Used Fleiss' κ and Cohen's κ for consistency evaluation and logistic regression for effect analysis.
ChatGPT demonstrated significantly higher accuracy (92.8%) compared to Gemini (84.8%).
ChatGPT showed near-perfect reproducibility (κ = 0.95), while Gemini had fair agreement (κ = 0.38).
Both models performed well in structured clinical domains but struggled in restoration and pulpal disease management.

Resumen

Large language model (LLM)-based chatbots are increasingly used in healthcare, yet their diagnostic accuracy, consistency, and temporal stability in endodontics remain insufficiently evaluated. This study aimed to assess and compare the performance of LLM-based chatbots using established international clinical guidelines. A diagnostic accuracy study with a repeated-measures design was conducted. Two LLMs (ChatGPT-5 and Gemini-2.5 Flash) were evaluated using 200 structured yes/no items derived from consensus-based international position statements covering the full scope of endodontic practice. Each model was tested weekly over three consecutive sessions, yielding 600 responses per model. Accuracy was assessed against reference answers, consistency was analyzed using Fleiss’ κ and Cohen’s κ, and logistic regression was performed to evaluate the effects of model type, week, and clinical domain. ChatGPT demonstrated significantly higher overall accuracy than Gemini (92.8% vs. 84.8%; OR = 2.37; p = 0.004) and near-perfect reproducibility (κ = 0.95), whereas Gemini showed fair agreement (κ = 0.38). Both models performed well in structured clinical domains but showed reduced accuracy in areas related to restoration and pulpal disease management. LLM-based chatbots show potential as decision support and educational tools in endodontics. However, performance limitations in complex clinical domains highlight the continued need for expert oversight in clinical decision-making.

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

Sismanoglu et al. (Thu,) studied this question.

synapsesocial.com/papers/69e31f7340886becb653ea95 https://doi.org/https://doi.org/10.1186/s12903-026-08269-8

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo