What question did this study set out to answer?

The study aims to evaluate the reliability of AI-based large language models in providing guidance on early maxillary expansion for children.

March 16, 2026Open Access

Evaluating the reliability of artificial intelligence in clinical decisions on early maxillary expansion: a comparative study of eight large language models

Q: What does this research mean for the field?

Some large language models (LLMs) can provide reliable and understandable information about early maxillary expansion for children, but none consistently excel across all evaluated criteria of accuracy, comprehensiveness, and readability. Novelty: ClaimNovelty.CONFIRMATORY. Consensus alignment: ConsensusAlignment.NEUTRAL.

Key Points

The study aims to evaluate the reliability of AI-based large language models in providing guidance on early maxillary expansion for children.
Eight large language models were evaluated using responses to 20 questions regarding maxillary expansion.
Responses were analyzed for accuracy, comprehensiveness, and readability.
Readability was assessed using Flesch Reading Ease Score and Flesch-Kincaid Grade Level.
Statistical analyses included both parametric and non-parametric tests.
DeepSeek V3 and Grok scored highest for accuracy and comprehensiveness.
DeepSeek V3 demonstrated the best readability in the mixed dentition phase.
Copilot, GPT-5, and GPT-4o had the best readability scores but lower content accuracy.
MediSearch, Gemini 2.5 Flash, and Claude 4.5 Sonnet performed weakest across evaluated criteria.

Abstract

Large language models (LLMs) are increasingly used to provide health-related information, yet their clinical reliability remains uncertain. This study aimed to evaluate the accuracy, comprehensiveness, and readability of AI-based LLMs in providing evidence-based guidance on early maxillary expansion for children in the primary and mixed dentition phases, assessing their potential as trustworthy resources for parental decision-making. Eight LLMs (DeepSeek V3, Gemini 2.5 Flash, Claude 4.5 Sonnet, MediSearch, Copilot, GPT-5, GPT-4o, and Grok) were tasked with responding to a total of 20 questions reflecting common parental concerns about early maxillary expansion, with 10 questions assigned to each dentition phase (primary and mixed). Responses were evaluated for accuracy and comprehensiveness, and readability was assessed using the Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade Level (FKGL). Statistical analyses included descriptive statistics and appropriate parametric and non-parametric tests based on data distribution, with significance set at p < 0.05. Significant differences were observed among the LLMs in terms of accuracy, comprehensiveness, and readability (p < 0.001). DeepSeek V3 and Grok achieved the highest scores for both accuracy and comprehensiveness across both dentition phases, with DeepSeek V3 also demonstrating the highest readability in the mixed dentition phase. Copilot, GPT-5, and GPT-4o produced the most readable outputs, as indicated by their highest FRES and lowest FKGL scores, though their content accuracy was comparatively lower. In contrast, MediSearch, Gemini 2.5 Flash, and Claude 4.5 Sonnet showed consistently weaker performance across all evaluated criteria. This study concluded that, although some LLMs can offer reliable and understandable information about early maxillary expansion, none consistently excel across all evaluated criteria. In healthcare contexts, integrating scientific accuracy with readability is crucial for supporting informed parental decision-making, enhancing overall health literacy, and strengthening patient–clinician communication. These findings highlight that AI-based LLMs should serve as supplementary tools that support, rather than replace, professional orthodontic guidance.

Bookmark

View Full Paper

Bookmark

View Full Paper

Evaluating the reliability of artificial intelligence in clinical decisions on early maxillary expansion: a comparative study of eight large language models

Key Points

Abstract

Cite This Study