What question did this study set out to answer?

This study aims to evaluate the accuracy, reliability, quality, and readability of responses from three LLM chatbots regarding orthognathic treatment.

May 15, 2026Open Access

Exploring LLM-based chatbot effectiveness in answering questions related to the risks and benefits of orthognathic treatment: a cross-sectional study

Key Points

This study aims to evaluate the accuracy, reliability, quality, and readability of responses from three LLM chatbots regarding orthognathic treatment.
Evaluated responses to 20 patient questions using three chatbots: ChatGPT4o, Microsoft Copilot, and Google Gemini 2.5 Flash.
Responses assessed with modified DISCERN, global quality scale, and Flesch Reading Ease; evaluated for reliability with Cohen's kappa and intra-class correlation.
Non-parametric statistical tests applied due to non-normal data distribution.
Copilot scored highest in reliability and quality with significant differences in modified DISCERN (P < 0.001) and GQS (P = 0.046).
Readability was better for Gemini and ChatGPT than Copilot, indicating varied accessibility of responses.
Accuracy scores did not show significant differences (P = 0.704), suggesting similar accuracy among chatbots.

Abstract

OBJECTIVE: To assess the accuracy, reliability, quality, and readability of responses generated by three large language model chatbots, ChatGPT4o, Microsoft Copilot, and Google Gemini 2.5 Flash, when answering common patient questions about the risks and benefits of orthognathic treatment. MATERIALS AND METHODS: Twenty frequently searched questions were identified via Google and entered into each chatbot. Responses were evaluated using validated scoring systems for accuracy, modified DISCERN, global quality scale (GQS), and Flesch Reading Ease. Intra- and inter-rater reliability was assessed using Cohen's kappa and intra-class correlation coefficients. Non-parametric tests were applied due to non-normal data distribution. RESULTS: Copilot achieved the highest reliability and quality scores, with significant differences observed in modified DISCERN (P < 0.001) and GQS (P = 0.046). Post hoc tests confirmed Copilot significantly outperformed ChatGPT. Accuracy scores did not differ significantly (P = 0.704). Readability varied significantly with Gemini and ChatGPT producing more accessible responses than Copilot. Intra- and inter-rater reliability scores were substantial to excellent for categorical measures and excellent for readability. CONCLUSIONS: Copilot provided the most reliable and high-quality responses, whilst ChatGPT and Gemini offered greater readability ease. Despite these strengths, variability in accuracy and reliability highlights the need for caution. Chatbots should be considered as supplementary tools, and patients should verify information with qualified professionals.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper