Background/Objectives: Large language models (LLMs) are increasingly used for patient education, yet comparative evidence regarding their accuracy, safety, and ethical performance remains limited, particularly in high-risk fields such as anesthesiology. This study aimed to conduct a multidimensional comparison of five contemporary LLMs in answering common patient questions in anesthesiology. Methods: In this cross-sectional, comparative in silico study, 30 standardized patient questions covering general anesthesia, spinal/epidural anesthesia, and peripheral nerve blocks were submitted to ChatGPT, Gemini, Microsoft Copilot, DeepSeek, and Grok. Responses were independently evaluated under full blinding by five senior anesthesiology professors using a 5-point Likert scale across six domains: accuracy, safety, completeness, understandability, ethics, and overall assessment. Inter-rater reliability was assessed using intraclass correlation coefficients (ICC). Performance differences were analyzed using linear mixed-effects models accounting for question- and evaluator-level variability, with results reported as estimated marginal means. Results: Inter-rater agreement was good to excellent across all domains (ICC > 0.75). Significant model-related differences were observed for overall assessment, accuracy, safety, completeness, and ethics (all p < 0.001), whereas understandability did not differ significantly between models. ChatGPT achieved the highest overall performance, while Gemini demonstrated superior accuracy. Model performance varied across anesthesiology subspecialties, with significant model × topic interactions identified in multiple domains (p < 0.01). Conclusions: LLMs may serve as supportive tools for patient education in anesthesiology; however, their performance varies substantially across models and clinical contexts. Differences in accuracy, safety, and ethical performance highlight the need for cautious, context-aware integration of LLMs into clinical practice rather than their use as substitutes for anesthesiologists’ clinical judgment.
Şahin et al. (Fri,) studied this question.