Parents increasingly rely on large language models (LLMs) to obtain pediatric health information; however, the accuracy, clinical appropriateness, and readability of AI-generated responses remain variable. This concern is particularly relevant for rickets, a preventable metabolic bone disease in which delayed recognition or inappropriate guidance may result in adverse outcomes. This study aimed to compare the content quality, clinical appropriateness, and readability of responses generated by contemporary LLMs to parent-oriented questions about rickets using structured, multidisciplinary expert evaluation. Twenty-two frequently asked parent-oriented questions regarding rickets were identified from authoritative patient education resources and categorized into four thematic domains. Each question was posed to three LLMs (GPT-5.1, DeepSeek V3.2, and Gemini 3 Pro) using a standardized parent-focused prompt. Responses were collected as single-turn outputs between November 16 and 20, 2025. All responses were anonymized and independently evaluated by four clinicians (two pediatricians and two orthopedic surgeons). Content quality was assessed using a modified Artificial Intelligence Evaluation Score for Common Patient Questions (AIES-CPQ; range 5–25) and the Global Quality Scale (GQS; range 1–5). Readability was analyzed using five established indices. Inter-model differences were assessed using the Friedman test with Bonferroni-adjusted Wilcoxon signed-rank post-hoc comparisons, and inter-rater reliability was evaluated using intraclass correlation coefficients. Significant differences were observed among models for both AIES-CPQ and GQS scores (p < 0.001). Gemini 3 Pro and DeepSeek V3.2 achieved higher overall content quality and educational scores compared with GPT-5.1, although their relative strengths varied across evaluation domains. DeepSeek V3.2 demonstrated higher inter-rater reliability, while Gemini 3 Pro generated more detailed but linguistically complex responses. Readability analysis revealed substantial variability across models, indicating a trade-off between informational depth and accessibility. LLM-generated responses to parent-oriented questions about rickets vary substantially in quality, clinical appropriateness, and readability. While newer-generation models provide higher-quality information, none demonstrate uniformly reliable performance across all domains. Structured, disease-specific evaluation frameworks combined with multidisciplinary expert oversight are essential before AI-generated content can be safely integrated into parent-facing pediatric education.
Çörekçi et al. (Wed,) studied this question.