What type of study is this?

This is a Validation Study study.

What question did this study set out to answer?

The study aims to assess the quality of ChatGPT-4.0 as a digital tool for patient education in anesthesia.

March 15, 2026Open Access

Evaluating ChatGPT-4 as a digital patient education tool in anesthesia: A multi-rater quality assessment

Key Points

The study aims to assess the quality of ChatGPT-4.0 as a digital tool for patient education in anesthesia.
Identified 22 common anesthesia-related questions through online search.
Submitted each question to ChatGPT-4.0 without follow-up prompts.
Five anesthesiology specialists evaluated responses using a validated scale.
Overall, 61.8% of responses were rated excellent, with none rated unsatisfactory.
Mean scores for questions ranged from 1.0 to 2.4, indicating positive evaluations.
Inter-rater reliability was found to be poor to fair (ICC = 0.25).

Abstract

Background Large language models such as ChatGPT are increasingly used by patients seeking perioperative information, yet their reliability for anesthesia-related patient education remains insufficiently evaluated. This study assessed the quality of ChatGPT-4.0 responses to frequently asked anesthesia questions using a multi-rater evaluation framework. Methods Twenty-two common anesthesia-related patient questions were identified through online search. Each question was submitted once to ChatGPT-4.0 (GPT-4-turbo; chat.openai.com) without follow-up prompts. Five anesthesiology and reanimation specialists—each with more than 20 years of experience—independently evaluated each response using a validated 4-point Likert-type scale (1 = excellent; 4 = unsatisfactory). Inter-rater reliability was calculated using a two-way random-effects model (ICC2,1). Results A total of 110 ratings were collected. Among these, 61.8% were classified as excellent, 32.7% as satisfactory requiring minimal clarification, and 5.5% as satisfactory requiring moderate clarification. No responses were rated as unsatisfactory. Mean scores for individual questions ranged from 1.0 to 2.4. Reviewer-wise averages ranged from 1.27 to 1.73, indicating generally positive evaluations with modest variability in scoring strictness. The overall inter-rater reliability was poor to fair (ICC = 0.25). Conclusions ChatGPT-4.0 provided high-quality responses to frequently asked patient questions about anesthesia and may serve as a supportive digital health tool for patient education. However, limited agreement among evaluators highlights the need for expert oversight and contextual refinement when integrating large language models into clinical communication pathways.

Evaluating ChatGPT-4 as a digital patient education tool in anesthesia: A multi-rater quality assessment

Key Points

Abstract

Cite This Study