What question did this study set out to answer?

This research aims to evaluate the readability and quality of responses from leading AI models regarding craniosynostosis surgery care.

January 25, 2026

Artificial Intelligence in Craniosynostosis Surgery: A Comparison of Large Language Models in Answering Perioperative Care Questions

Key Points

This research aims to evaluate the readability and quality of responses from leading AI models regarding craniosynostosis surgery care.
Evaluated responses from 4 large language models (LLMs) on 20 perioperative questions about craniosynostosis surgery.
Assessed information quality using modified DISCERN criteria and readability using the SMOG Index.
Conducted statistical analysis including 1-way ANOVA and Tukey HSD to compare model performance.
OpenEvidence had the highest readability score, indicating a graduate-level understanding requirement.
Google Gemini achieved the best quality score and was rated highly for clarity and supporting shared decision-making.
ChatGPT, DeepSeek, and OpenEvidence provided information at an undergraduate reading level, with lower quality scores compared to Gemini.

Abstract

Background: Craniosynostosis surgery poses complex challenges for caregivers. Often, they utilize large language models (LLMs) for preoperative and postoperative information. Although LLMs offer accessible guidance, persistent concerns center on their information quality and readability, especially in specialized surgical contexts. Methods: This study evaluates the readability and quality of responses from 4 leading LLMs, ChatGPT-4o, Google Gemini 2.0, DeepSeek, and OpenEvidence, to 20 standardized perioperative questions about craniosynostosis surgery. Quality was assessed using modified DISCERN criteria, and readability was measured using SMOG Index. Statistical analysis included 1-way ANOVA and Tukey HSD. Results: By SMOG score, OpenEvidence produced responses at the highest reading level (17.54), indicating a graduate-level comprehension requirement. ChatGPT (14.45), DeepSeek (14.40), and Google Gemini (15.17) generated information at an undergraduate reading level. By information quality, measured by modified DISCERN scores, Google Gemini achieved the highest score (42.95, P <0.001) out of maximum 45, significantly outperforming ChatGPT (36.25), DeepSeek (37.55), and OpenEvidence (36.75). Gemini’s responses were rated highest in clarity, citation use, and support for shared decision-making. Conclusions: LLMs vary significantly in readability and information quality. Google Gemini offered the most trustworthy content, whereas DeepSeek was most accessible. No single model excelled across all dimensions, suggesting that clinicians should guide caregivers toward LLMs best suited to their literacy level. Generative AI holds promise for augmenting patient education in craniosynostosis care. However, it should be used alongside clinician input to ensure clarity, accuracy, and relevance.

Bookmark

Artificial Intelligence in Craniosynostosis Surgery: A Comparison of Large Language Models in Answering Perioperative Care Questions

Key Points

Abstract

Cite This Study