What question did this study set out to answer?

May 21, 2026

Assessment of accuracy, reliability, readability, and semantic consistency of generative AI responses to rephrased orthodontic patient questions

Key Points

The aim is to evaluate the accuracy, quality, reliability, readability, and semantic consistency of AI responses to orthodontic questions.
Identified 13 orthodontic questions and created 26 rephrased prompts.
Submitted prompts to five AI models (ChatGPT-3.5, ChatGPT-4, Bing Copilot, Google Gemini 1.0 Pro, LLaMA-3-70B-Instruct).
Responses rated by orthodontists using various scales and statistical analyses performed, including ANOVA and ICC.
ChatGPT-4 achieved moderate consistency (54%).
Gemini showed high stability (92%) but lower overall quality.
Bing Copilot and ChatGPT-3.5 had low consistency (46% and 23%) and all models produced low readability scores (Flesch scores 45–49).

Abstract

Abstract Background Patients increasingly consult generative artificial intelligence (GenAI) chatbots for orthodontic information. However, as a single question may be phrased in multiple ways, the consistency of responses remains uncertain. Objective To evaluate and compare the accuracy, quality, reliability, readability, and semantic stability of five GenAI Chabot models in answering common orthodontic patient questions and their rephrased variants. Materials and Methods 13 frequently asked orthodontic questions were identified and rephrased once, yielding 26 prompts. Each prompt was submitted to five AI models under standardized conditions. All models were accessed via publicly available web-based chat interfaces, including ChatGPT-3.5, ChatGPT-4, Bing Copilot, Google Gemini 1.0 Pro, and LLaMA-3-70B-Instruct Five orthodontists independently rated responses using a 4-point accuracy Likert scale, Global Quality Scale, modified DISCERN checklist, and Flesch Reading Ease Score (FRES). Readability was additionally assessed by students in grades 5–8. Semantic stability was determined by comparing ratings for original versus rephrased prompts. Statistical analyses included ANOVA, chi-square tests, and intraclass correlation coefficients (ICC). Results ChatGPT-4 ranked second but exhibited moderate consistency (54%). Gemini showed good stability (92%) but lower quality. Bing Copilot and ChatGPT-3.5 underperformed, especially in consistency (46% and 23%). All models produced responses with low readability (Flesch scores 45–49). Conclusions GenAI chatbots provide generally accurate orthodontic information, but performance depends on question phrasing. Only LLaMa and Gemini show high consistency. Better consistency, readability, and evidence-based accuracy are needed before the use of LLMs in patient education.

Bookmark

Cite This Study

Batra et al. (Sat,) studied this question.

synapsesocial.com/papers/6a0ea15cbe05d6e3efb5ff18 https://doi.org/https://doi.org/10.1093/ejo/cjag024

Bookmark