What question did this study set out to answer?

This research aimed to assess the accuracy, readability, and comprehensiveness of AI-generated dental advice concerning functional appliances.

May 24, 2026Open Access

Accuracy, readability, and content coverage of AI-generated responses to questions on functional appliances

Key Points

This research aimed to assess the accuracy, readability, and comprehensiveness of AI-generated dental advice concerning functional appliances.
Responses to 38 frequently asked questions were generated by four AI models: ChatGPT-4o Mini, ChatGPT-5, Google Gemini 2.5 Flash, and DeepSeek V3.
Readability assessed via Flesch–Kincaid tools; accuracy and comprehensiveness rated by two blinded orthodontists.
Questions categorized into three domains: treatment fundamentals, lifestyle concerns, and appointments.
AI responses were generally accurate and comprehensive but required college-level literacy; ChatGPT-5 had the lowest readability scores (P < .001).
Gemini 2.5 Flash achieved the highest comprehensiveness scores across all domains, though not statistically significant.
Treatment-related questions had significantly lower readability scores than lifestyle-related queries (P < .001).

Abstract

Patients increasingly rely on Large Language Models (LLMs) for health information, yet the accuracy and readability of AI-generated dental advice remain variable across different clinical domains and Artificial Intelligence (AI) models. This study therefore aimed to compare the readability, accuracy, and comprehensiveness of responses generated by four leading AI models (ChatGPT-4o Mini, ChatGPT-5, Google Gemini 2.5 Flash, and DeepSeek V3) to patient questions on functional appliances. Thirty-eight frequently asked questions were identified using a structured Google search and categorized into three domains: “treatment fundamentals and general information,” “lifestyle and practical concerns,” and “appointments and long-term results”. Each question was independently answered by the four AI models. Readability was assessed using the Flesch–Kincaid tools. Accuracy and comprehensiveness were independently rated by two blinded orthodontists. AI-generated responses were generally accurate and comprehensive but difficult to read, requiring college-level literacy. ChatGPT-5 produced the lowest readability scores (most difficult-to-read responses; P < .001). Although Gemini 2.5 Flash achieved the highest comprehensiveness scores across all three domains, these differences were not statistically significant. Treatment-related questions yielded lower readability scores than lifestyle-related queries across all models (P < .001). No single model demonstrated superior performance across all evaluated domains. AI-generated information on functional appliances was generally accurate and comprehensive but often exceeded recommended patient literacy thresholds. Readability must be considered alongside informational quality when deploying AI tools for patient education.

Mark Helpful

Bookmark

Relay

View Full Paper