What question did this study set out to answer?

This study compares the performance of large language models in providing patient education in anesthesiology.

March 15, 2026Open Access

Artificial Intelligence as a Support Tool for Preoperative Patient Education in Anesthesiology: A Comparative Evaluation of Five Large Language Models

Key Points

This study compares the performance of large language models in providing patient education in anesthesiology.
Cross-sectional, comparative study design
Input of 30 standardized patient questions regarding anesthesiology
Five language models evaluated: ChatGPT, Gemini, Microsoft Copilot, DeepSeek, Grok
Responses assessed by five anesthesiology professors using a 5-point Likert scale across multiple domains
Data analyzed using linear mixed-effects models to account for variability across questions and evaluators.
Good to excellent inter-rater agreement across assessment domains (ICC > 0.75)
Significant differences in overall assessment, safety, accuracy, completeness, and ethics were found (p < 0.001)
ChatGPT performed highest overall, while Gemini had superior accuracy
Model performance differed by anesthesiology subspecialties with notable interactions (p < 0.01)

Abstract

Background/Objectives: Large language models (LLMs) are increasingly used for patient education, yet comparative evidence regarding their accuracy, safety, and ethical performance remains limited, particularly in high-risk fields such as anesthesiology. This study aimed to conduct a multidimensional comparison of five contemporary LLMs in answering common patient questions in anesthesiology. Methods: In this cross-sectional, comparative in silico study, 30 standardized patient questions covering general anesthesia, spinal/epidural anesthesia, and peripheral nerve blocks were submitted to ChatGPT, Gemini, Microsoft Copilot, DeepSeek, and Grok. Responses were independently evaluated under full blinding by five senior anesthesiology professors using a 5-point Likert scale across six domains: accuracy, safety, completeness, understandability, ethics, and overall assessment. Inter-rater reliability was assessed using intraclass correlation coefficients (ICC). Performance differences were analyzed using linear mixed-effects models accounting for question- and evaluator-level variability, with results reported as estimated marginal means. Results: Inter-rater agreement was good to excellent across all domains (ICC > 0.75). Significant model-related differences were observed for overall assessment, accuracy, safety, completeness, and ethics (all p < 0.001), whereas understandability did not differ significantly between models. ChatGPT achieved the highest overall performance, while Gemini demonstrated superior accuracy. Model performance varied across anesthesiology subspecialties, with significant model × topic interactions identified in multiple domains (p < 0.01). Conclusions: LLMs may serve as supportive tools for patient education in anesthesiology; however, their performance varies substantially across models and clinical contexts. Differences in accuracy, safety, and ethical performance highlight the need for cautious, context-aware integration of LLMs into clinical practice rather than their use as substitutes for anesthesiologists’ clinical judgment.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Şahin et al. (Fri,) studied this question.

synapsesocial.com/papers/69b6068883145bc643d1c742 https://doi.org/https://doi.org/10.3390/jcm15062197

Bookmark

View Full Paper