Patients increasingly rely on Large Language Models (LLMs) for health information, yet the accuracy and readability of AI-generated dental advice remain variable across different clinical domains and Artificial Intelligence (AI) models. This study therefore aimed to compare the readability, accuracy, and comprehensiveness of responses generated by four leading AI models (ChatGPT-4o Mini, ChatGPT-5, Google Gemini 2.5 Flash, and DeepSeek V3) to patient questions on functional appliances. Thirty-eight frequently asked questions were identified using a structured Google search and categorized into three domains: “treatment fundamentals and general information,” “lifestyle and practical concerns,” and “appointments and long-term results”. Each question was independently answered by the four AI models. Readability was assessed using the Flesch–Kincaid tools. Accuracy and comprehensiveness were independently rated by two blinded orthodontists. AI-generated responses were generally accurate and comprehensive but difficult to read, requiring college-level literacy. ChatGPT-5 produced the lowest readability scores (most difficult-to-read responses; P < .001). Although Gemini 2.5 Flash achieved the highest comprehensiveness scores across all three domains, these differences were not statistically significant. Treatment-related questions yielded lower readability scores than lifestyle-related queries across all models (P < .001). No single model demonstrated superior performance across all evaluated domains. AI-generated information on functional appliances was generally accurate and comprehensive but often exceeded recommended patient literacy thresholds. Readability must be considered alongside informational quality when deploying AI tools for patient education.
Badran et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: