What question did this study set out to answer?

This study aims to evaluate the performance of ChatGPT and Gemini in answering patient questions after gynecologic surgery.

February 26, 2026

Evaluation of ChatGPT and Gemini in Answering Patient Questions after Gynecologic Surgery

Key Points

This study aims to evaluate the performance of ChatGPT and Gemini in answering patient questions after gynecologic surgery.
Conducted a cross-sectional study with two large language models, GPT-4 and Gemini.
Developed common post-operative patient questions based on expert opinions and Reddit inputs.
Responses were evaluated for accuracy, relevance, helpfulness, and readability by four surgeons and three nurses.
Gemini scored higher in accuracy and helpfulness than GPT-4 with significant p-values (0.015 and 0.025, respectively).
Both models produced relevant responses, scoring similarly on relevance (p=0.2).
85% of GPT-4 and 87% of Gemini’s responses were consistent across questions.

Abstract

Objective: To explore the performance of ChatGPT version 4.0 (GPT-4) and Gemini Advanced (Gemini) large language models (LLMs) in addressing common patient questions after gynecology surgery with regards to accuracy, relevance, helpfulness, and readability. Methods: In this cross-sectional study, two LLMs were prompted to generate answers to post-operative patient questions after gynecologic surgery. Post-operative patient questions were developed to simulate common patient questions after gynecologic surgery, based on expert opinion and compiled from anonymous posters on Reddit (r/endometriosis). Six topics were emphasized: endometriosis, vaginal bleeding, bowel/bladder function, incision care, resumption of activities, and sexual function. Questions were asked in a systematic submission process with the memory reset after each query. Responses were blinded and independently assessed for accuracy and relevance on a 5 Point Likert scale by four board-certified gynecologic surgeons with fellowship training in gynecologic surgery. Readability was calculated with the Flesch Kincaid grade level. Responses were also evaluated by three clinic nurses. Results: 41 questions were posed to GPT-4 and Gemini three times. These responses were independently evaluated by four surgeons and three nurses leading to a total of 1,968 evaluations for accuracy, relevance, helpfulness to the average patient, and readability. Surgeons and nurses graded Gemini responses as more accurate (4.23 vs 4.03, p=0.015) and helpful (4.37 vs 4.21, p=0.025) than GPT-4 responses. Responses from both models were similarly found to be relevant or very relevant (4.45 vs 4.36, p=0.2). Most responses by GPT-4 (85%) and Gemini (87%) were consistent across all questions. The average reading level for GPT-4 and Gemini responses were 11th and 10th grade, above the recommended 6th grade reading level for patient information. Conclusion: GPT-4 and Gemini provided overall accurate, relevant, and helpful responses to common post-operative patient questions for gynecologic surgery. Gemini outperformed GPT-4 in accuracy and helpfulness and had more readable responses.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Petra Voigt

Rhea Sharma

Angela Chaudhari

Journals

Applied Clinical Informatics

Actions

Institutions

Northwestern University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluation of ChatGPT and Gemini in Answering Patient Questions after Gynecologic Surgery

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study