Background: Craniosynostosis surgery poses complex challenges for caregivers. Often, they utilize large language models (LLMs) for preoperative and postoperative information. Although LLMs offer accessible guidance, persistent concerns center on their information quality and readability, especially in specialized surgical contexts. Methods: This study evaluates the readability and quality of responses from 4 leading LLMs, ChatGPT-4o, Google Gemini 2.0, DeepSeek, and OpenEvidence, to 20 standardized perioperative questions about craniosynostosis surgery. Quality was assessed using modified DISCERN criteria, and readability was measured using SMOG Index. Statistical analysis included 1-way ANOVA and Tukey HSD. Results: By SMOG score, OpenEvidence produced responses at the highest reading level (17.54), indicating a graduate-level comprehension requirement. ChatGPT (14.45), DeepSeek (14.40), and Google Gemini (15.17) generated information at an undergraduate reading level. By information quality, measured by modified DISCERN scores, Google Gemini achieved the highest score (42.95, P <0.001) out of maximum 45, significantly outperforming ChatGPT (36.25), DeepSeek (37.55), and OpenEvidence (36.75). Gemini’s responses were rated highest in clarity, citation use, and support for shared decision-making. Conclusions: LLMs vary significantly in readability and information quality. Google Gemini offered the most trustworthy content, whereas DeepSeek was most accessible. No single model excelled across all dimensions, suggesting that clinicians should guide caregivers toward LLMs best suited to their literacy level. Generative AI holds promise for augmenting patient education in craniosynostosis care. However, it should be used alongside clinician input to ensure clarity, accuracy, and relevance.
Mangal et al. (Fri,) studied this question.