Large language models (LLMs) have shown promising performance for ASA Physical Status (ASA-PS) classification, but prior work suggests reduced agreement in high-risk patients. We evaluated LLM reliability for ASA-PS classification in cardiovascular surgery. Thirty-two anonymized cases were rated by two residents, two board-certified cardiovascular anesthesiologists, and four LLM modes (ChatGPT: GPT-5.2 Instant and GPT-5.2 Thinking; Gemini: Gemini 3 Fast and Gemini 3 High Thinking); all LLM assessments were zero-shot. Overall agreement across evaluators was moderate (intraclass correlation coefficient ICC 0.49–0.52); agreement between each LLM and specialists was good (ICC 0.61–0.65). Exact-match to a five-specialist consensus was 42.2% for residents versus 59.4–75.0% for LLMs; classifications outside the range of ratings assigned by individual specialists were rare (0–3.1%). In cardiovascular surgery, contemporary LLMs showed good concordance with cardiovascular anesthesiologists and exceeded resident agreement with expert consensus, supporting prospective multicenter validation as adjuncts for ASA-PS assessment and training.
Iwabu et al. (Wed,) studied this question.