Abstract Aim: To compare the clinical quality, readability, and originality of responses from nine generative large language models (LLMs) to pediatric dental queries commonly asked by caregivers. Materials and Methods: A cross-sectional evaluation of nine LLMs (ChatGPT-3.5, ChatGPT-4o, Claude 3.5 Haiku, Claude 3.7 Sonnet, Gemini 2.0, Gemini 2.5, Grok-3, Grok-3 Mini, and DeepSeek-V3) was conducted using 20 standardized open-ended pediatric dental questions. Responses were rated by 10 pediatric dentists using the Modified Global Quality Scale (MGQS). Readability was assessed via flesch reading ease and flesch–kincaid grade level, and originality was analyzed using Turnitin®. One-way analysis of variance with post hoc tests and Cohen’s Kappa were applied. Results: ChatGPT-4o achieved the highest MGQS score (4.40 ± 0.30, P < 0.001), while DeepSeek-V3 performed the lowest (2.02 ± 0.25). Claude 3.7 Sonnet produced the most readable responses (FRE 76.29 ± 10.77), whereas Grok-3 Mini was the most complex (FKGL 14.10 ± 3.90). All LLMs demonstrated high originality (<17% similarity), with Claude 3.5 Haiku and Grok-3 Mini showing the lowest overlap (2%). Inter-rater agreement was substantial (κ = 0.72). Conclusion: ChatGPT-4o demonstrated superior content quality, while Claude 3.7 Sonnet and Gemini 2.5 provided more user-friendly readability. Performance variability among LLMs warrants cautious integration into pediatric dental guidance.
Raj et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: