What question did this study set out to answer?

April 24, 2026Open Access

Performance of three large language models in answering parent-focused questions on rickets: a dual pediatric–orthopedic specialist evaluation

Key Points

This study aimed to compare the quality, appropriateness, and readability of responses generated by LLMs for questions about rickets.
Conducted structured evaluations of three LLMs (GPT-5.1, DeepSeek V3.2, Gemini 3 Pro) using 22 parent-focused rickets questions.
Evaluated responses using modified evaluation scores and readability indices, involving four clinicians for independent assessment.
Analyzed inter-model differences and inter-rater reliability through statistical tests.
Significant differences in content quality scores among LLMs, with Gemini 3 Pro and DeepSeek V3.2 outperforming GPT-5.1.
DeepSeek V3.2 exhibited higher inter-rater reliability compared to others, while Gemini 3 Pro provided more detailed, but complex responses.
Readability analysis showed variability across models, highlighting a trade-off between depth of information and ease of understanding.

Abstract

Parents increasingly rely on large language models (LLMs) to obtain pediatric health information; however, the accuracy, clinical appropriateness, and readability of AI-generated responses remain variable. This concern is particularly relevant for rickets, a preventable metabolic bone disease in which delayed recognition or inappropriate guidance may result in adverse outcomes. This study aimed to compare the content quality, clinical appropriateness, and readability of responses generated by contemporary LLMs to parent-oriented questions about rickets using structured, multidisciplinary expert evaluation. Twenty-two frequently asked parent-oriented questions regarding rickets were identified from authoritative patient education resources and categorized into four thematic domains. Each question was posed to three LLMs (GPT-5.1, DeepSeek V3.2, and Gemini 3 Pro) using a standardized parent-focused prompt. Responses were collected as single-turn outputs between November 16 and 20, 2025. All responses were anonymized and independently evaluated by four clinicians (two pediatricians and two orthopedic surgeons). Content quality was assessed using a modified Artificial Intelligence Evaluation Score for Common Patient Questions (AIES-CPQ; range 5–25) and the Global Quality Scale (GQS; range 1–5). Readability was analyzed using five established indices. Inter-model differences were assessed using the Friedman test with Bonferroni-adjusted Wilcoxon signed-rank post-hoc comparisons, and inter-rater reliability was evaluated using intraclass correlation coefficients. Significant differences were observed among models for both AIES-CPQ and GQS scores (p < 0.001). Gemini 3 Pro and DeepSeek V3.2 achieved higher overall content quality and educational scores compared with GPT-5.1, although their relative strengths varied across evaluation domains. DeepSeek V3.2 demonstrated higher inter-rater reliability, while Gemini 3 Pro generated more detailed but linguistically complex responses. Readability analysis revealed substantial variability across models, indicating a trade-off between informational depth and accessibility. LLM-generated responses to parent-oriented questions about rickets vary substantially in quality, clinical appropriateness, and readability. While newer-generation models provide higher-quality information, none demonstrate uniformly reliable performance across all domains. Structured, disease-specific evaluation frameworks combined with multidisciplinary expert oversight are essential before AI-generated content can be safely integrated into parent-facing pediatric education.

Bookmark

View Full Paper

Cite This Study

Çörekçi et al. (Wed,) studied this question.

synapsesocial.com/papers/69eb0803553a5433e34b34ac https://doi.org/https://doi.org/10.1186/s12887-026-06851-1

Bookmark

View Full Paper