Abstract Background Patients increasingly consult generative artificial intelligence (GenAI) chatbots for orthodontic information. However, as a single question may be phrased in multiple ways, the consistency of responses remains uncertain. Objective To evaluate and compare the accuracy, quality, reliability, readability, and semantic stability of five GenAI Chabot models in answering common orthodontic patient questions and their rephrased variants. Materials and Methods 13 frequently asked orthodontic questions were identified and rephrased once, yielding 26 prompts. Each prompt was submitted to five AI models under standardized conditions. All models were accessed via publicly available web-based chat interfaces, including ChatGPT-3.5, ChatGPT-4, Bing Copilot, Google Gemini 1.0 Pro, and LLaMA-3-70B-Instruct Five orthodontists independently rated responses using a 4-point accuracy Likert scale, Global Quality Scale, modified DISCERN checklist, and Flesch Reading Ease Score (FRES). Readability was additionally assessed by students in grades 5–8. Semantic stability was determined by comparing ratings for original versus rephrased prompts. Statistical analyses included ANOVA, chi-square tests, and intraclass correlation coefficients (ICC). Results ChatGPT-4 ranked second but exhibited moderate consistency (54%). Gemini showed good stability (92%) but lower quality. Bing Copilot and ChatGPT-3.5 underperformed, especially in consistency (46% and 23%). All models produced responses with low readability (Flesch scores 45–49). Conclusions GenAI chatbots provide generally accurate orthodontic information, but performance depends on question phrasing. Only LLaMa and Gemini show high consistency. Better consistency, readability, and evidence-based accuracy are needed before the use of LLMs in patient education.
Batra et al. (Sat,) studied this question.