What question did this study set out to answer?

April 17, 2026Open Access

Evaluation of the reliability of large language models for ASA-PS classification in cardiovascular surgery: a pilot study

Key Points

This study aims to assess the reliability of large language models for ASA-PS classification in cardiovascular surgery.
Rated 32 anonymized cases by two residents and two board-certified cardiovascular anesthesiologists.
Evaluated four large language model modes including ChatGPT and Gemini.
Utilized zero-shot evaluation for model assessments.
Calculated overall agreement using intraclass correlation coefficients.
Moderate overall agreement among evaluators (ICC 0.49–0.52).
Good agreement between LLMs and specialists (ICC 0.61–0.65).
Exact-match rates were 42.2% for residents and 59.4–75.0% for LLMs.
Classifications outside expert ranges were rare (0–3.1%).

Abstract

Large language models (LLMs) have shown promising performance for ASA Physical Status (ASA-PS) classification, but prior work suggests reduced agreement in high-risk patients. We evaluated LLM reliability for ASA-PS classification in cardiovascular surgery. Thirty-two anonymized cases were rated by two residents, two board-certified cardiovascular anesthesiologists, and four LLM modes (ChatGPT: GPT-5.2 Instant and GPT-5.2 Thinking; Gemini: Gemini 3 Fast and Gemini 3 High Thinking); all LLM assessments were zero-shot. Overall agreement across evaluators was moderate (intraclass correlation coefficient ICC 0.49–0.52); agreement between each LLM and specialists was good (ICC 0.61–0.65). Exact-match to a five-specialist consensus was 42.2% for residents versus 59.4–75.0% for LLMs; classifications outside the range of ratings assigned by individual specialists were rare (0–3.1%). In cardiovascular surgery, contemporary LLMs showed good concordance with cardiovascular anesthesiologists and exceeded resident agreement with expert consensus, supporting prospective multicenter validation as adjuncts for ASA-PS assessment and training.

Bookmark

View Full Paper

Bookmark

View Full Paper

Evaluation of the reliability of large language models for ASA-PS classification in cardiovascular surgery: a pilot study

Key Points

Abstract

Cite This Study