OBJECTIVE: To assess the accuracy, reliability, quality, and readability of responses generated by three large language model chatbots, ChatGPT4o, Microsoft Copilot, and Google Gemini 2.5 Flash, when answering common patient questions about the risks and benefits of orthognathic treatment. MATERIALS AND METHODS: Twenty frequently searched questions were identified via Google and entered into each chatbot. Responses were evaluated using validated scoring systems for accuracy, modified DISCERN, global quality scale (GQS), and Flesch Reading Ease. Intra- and inter-rater reliability was assessed using Cohen's kappa and intra-class correlation coefficients. Non-parametric tests were applied due to non-normal data distribution. RESULTS: Copilot achieved the highest reliability and quality scores, with significant differences observed in modified DISCERN (P < 0.001) and GQS (P = 0.046). Post hoc tests confirmed Copilot significantly outperformed ChatGPT. Accuracy scores did not differ significantly (P = 0.704). Readability varied significantly with Gemini and ChatGPT producing more accessible responses than Copilot. Intra- and inter-rater reliability scores were substantial to excellent for categorical measures and excellent for readability. CONCLUSIONS: Copilot provided the most reliable and high-quality responses, whilst ChatGPT and Gemini offered greater readability ease. Despite these strengths, variability in accuracy and reliability highlights the need for caution. Chatbots should be considered as supplementary tools, and patients should verify information with qualified professionals.
Smyth et al. (Fri,) studied this question.