What does this research mean for the field?

ChatGPT provides more clinically accurate and appropriate responses than Google Gemini for post-operative patient questions after kyphoplasty. Novelty: ClaimNovelty.CONFIRMATORY. Consensus alignment: ConsensusAlignment.SUPPORTS_CONSENSUS.

What question did this study set out to answer?

To compare the outputs of ChatGPT and Google Gemini in answering common post-operative questions from kyphoplasty patients regarding clinical accuracy and readability.

March 2, 2026Open Access

Can Large Language Models Reliably Educate Patients After Kyphoplasty? A Clinician-Rated Comparative Study of ChatGPT and Gemini

Key Points

To compare the outputs of ChatGPT and Google Gemini in answering common post-operative questions from kyphoplasty patients regarding clinical accuracy and readability.
Developed thirteen post-operative questions for patients after kyphoplasty.
Evaluated responses from ChatGPT and Google Gemini with five blinded clinicians using a 5-point Likert scale.
Assessed readability with the Flesch-Kincaid grade level and a 3-point Likert scale.
Utilized matched-pair t-tests to analyze response comparisons between the models.
ChatGPT demonstrated significantly higher clinical accuracy (p<0.001) and appropriateness (p<0.01) than Gemini.
The average Flesch-Kincaid grade level for ChatGPT was 12.2, compared to 13.0 for Gemini (p=0.05).
On the readability Likert scale, ChatGPT scored 1.56/2 while Gemini scored 1.85/2 (p=0.01).

Abstract

Large language models, such as ChatGPT and Google Gemini, are becoming increasingly used in medicine for various purposes, ranging from medical education to research. Given the accessibility of consumer-facing models, patients may turn to them for answers to their medical questions. To compare outputs from ChatGPT and Google Gemini in response to common post-operative questions from patients after kyphoplasty. Thirteen common post-operative questions were compiled and asked to ChatGPT and Gemini. Five clinicians assessed the clinical accuracy and appropriateness of the responses using a 5-point Likert scale. Reviewers were blinded to model identity. Readability was evaluated by three raters using the Flesch-Kincaid grade level and a 3-point Likert scale. Matched-pair t-tests were used to compare responses from ChatGPT and Google Gemini, with statistical significance defined as a p-value < 0.05. ChatGPT responses were more accurate (p<0.001) and appropriate (p<0.01) compared to Gemini. ChatGPT's average Flesch-Kincaid grade level was 12.2, compared to 13.0 for Gemini (p = 0.05). On the 3-point Likert scale for readability, ChatGPT scored an average of 1.56/2, while Gemini scored 1.85/2 (p = 0.01). ChatGPT outperformed Gemini in terms of clinical accuracy and the appropriateness of responses. The results for readability were mixed, with the Flesch-Kincaid system indicating that ChatGPT generated responses at a higher grade level, while the Likert scale showed that Gemini’s responses were easier to read. While ChatGPT demonstrated better clinical accuracy and appropriateness, the use of LLM should not replace clinician-delivered postoperative counseling.

Can Large Language Models Reliably Educate Patients After Kyphoplasty? A Clinician-Rated Comparative Study of ChatGPT and Gemini

Key Points

Abstract

Cite This Study