What question did this study set out to answer?

The study aims to evaluate how accurately different AI systems respond to patient inquiries about endodontic pain and antibiotic use.

April 19, 2026Open Access

Evaluation of the Accuracy of Responses Provided by AI-Based Conversational Systems to Patient Questions Regarding Endodontic Pain and Antibiotic Use

Key Points

The study aims to evaluate how accurately different AI systems respond to patient inquiries about endodontic pain and antibiotic use.
Prepared 20 clinical scenarios on endodontic pain and antibiotic usage.
Evaluated responses from four AI systems: ChatGPT, DeepSeek, Gemini, and Copilot.
Responses rated by an endodontic specialist using a 3-point scale for accuracy.
Statistical analysis performed using Wilcoxon signed-rank and Kruskal-Wallis tests.
All AI systems performed similarly for scenarios requiring antibiotic use and those that did not.
No statistically significant difference in accuracy was found among the AI systems (p > 0.05).
A total of 80 responses were evaluated, with 56 classified as correct and 24 as partially correct.

Abstract

Aim This study aimed to comparatively evaluate the accuracy of responses provided by different AI-based conversational systems to patient questions regarding endodontic pain and antibiotic use. Methods In this study, a total of 20 clinical scenarios related to endodontic pain and antibiotic use were prepared. Ten of the scenarios represented clinical conditions in which antibiotic use was indicated, whereas the other 10 represented conditions in which it was not indicated. All prepared scenarios were directed to four different AI-based systems: ChatGPT (OpenAI, San Francisco, USA), DeepSeek (DeepSeek AI, Hangzhou, China), Gemini (Google, Mountain View, USA), and Copilot (Microsoft, Redmond, USA), and responses were recorded by initiating a new session for each scenario in the relevant system. The responses were evaluated by an endodontic specialist using a 3-point scale in terms of antibiotic use indications (1 = incorrect, 2 = partially correct, 3 = correct). Wilcoxon signed-rank test and Kruskal-Wallis test were used for data analysis, and the significance level was set at p < 0.05. Results All AI systems showed similar performance in scenarios where antibiotic use was indicated and not indicated. The difference between indicated and non-indicated scenarios was not statistically significant for ChatGPT, DeepSeek, Gemini, and Copilot (p = 0.317, p = 0.564, p = 0.317, and p = 0.102, respectively). No significant difference was also found among the AI systems in terms of overall performance (H = 3.292; p = 0.349). As each of the 20 clinical scenarios was submitted to four different AI-based conversational systems, a total of 80 responses were evaluated. Of these, 56 were classified as correct and 24 as partially correct, whereas no responses were observed in the incorrect category. Conclusion The evaluated AI-based conversational systems generally provided correct or partially correct responses to patient questions related to endodontic pain and antibiotic use. No statistically significant difference was found among the systems, and all systems demonstrated similar performance. These findings suggest that AI-based systems may have supportive potential in patient information provision. Nevertheless, due to the presence of incomplete or ambiguous responses, it is clear that these systems should not replace expert evaluation.

Bookmark

View Full Paper

Bookmark

View Full Paper

Evaluation of the Accuracy of Responses Provided by AI-Based Conversational Systems to Patient Questions Regarding Endodontic Pain and Antibiotic Use

Key Points

Abstract

Cite This Study