Large language models, such as ChatGPT and Google Gemini, are becoming increasingly used in medicine for various purposes, ranging from medical education to research. Given the accessibility of consumer-facing models, patients may turn to them for answers to their medical questions. To compare outputs from ChatGPT and Google Gemini in response to common post-operative questions from patients after kyphoplasty. Thirteen common post-operative questions were compiled and asked to ChatGPT and Gemini. Five clinicians assessed the clinical accuracy and appropriateness of the responses using a 5-point Likert scale. Reviewers were blinded to model identity. Readability was evaluated by three raters using the Flesch-Kincaid grade level and a 3-point Likert scale. Matched-pair t-tests were used to compare responses from ChatGPT and Google Gemini, with statistical significance defined as a p-value < 0.05. ChatGPT responses were more accurate (p<0.001) and appropriate (p<0.01) compared to Gemini. ChatGPT's average Flesch-Kincaid grade level was 12.2, compared to 13.0 for Gemini (p = 0.05). On the 3-point Likert scale for readability, ChatGPT scored an average of 1.56/2, while Gemini scored 1.85/2 (p = 0.01). ChatGPT outperformed Gemini in terms of clinical accuracy and the appropriateness of responses. The results for readability were mixed, with the Flesch-Kincaid system indicating that ChatGPT generated responses at a higher grade level, while the Likert scale showed that Gemini’s responses were easier to read. While ChatGPT demonstrated better clinical accuracy and appropriateness, the use of LLM should not replace clinician-delivered postoperative counseling.
Zhu et al. (Sun,) studied this question.