Do GPT-5 and GPT-4o accurately diagnose STEMI from 12-lead ECGs compared to cardiologists and emergency physicians in patients presenting with chest pain?
Current large language models like GPT-5 and GPT-4o are not reliable as independent diagnostic tools for STEMI due to high false-positive rates and inferior overall accuracy compared to clinicians.
BACKGROUND ST-segment elevation myocardial infarction (STEMI) requires rapid, accurate electrocardiogram (ECG) interpretation. The diagnostic effectiveness of new large language models (LLMs) like GPT-5 and GPT-4o in this high-risk area remains a critical knowledge gap. OBJECTIVES We aimed to evaluate the diagnostic performance of GPT-5 and GPT-4o in diagnosing STEMI from 12-lead ECGs, comparing them against emergency medicine specialists (EMSs) and cardiologists. METHODS In a case-control study of 234 patients, we included 117 angiography-confirmed STEMI cases and 117 age- and sex-matched controls presenting with chest pain but without STEMI. Anonymized ECG images were presented to 3 EMSs, 3 cardiologists, GPT-5, and GPT-4o with a dichotomous (yes/no) question: "Is there a STEMI?"AI models were queried three times on different days to measure response consistency. Performance was compared using accuracy, sensitivity, specificity, predictive values, and likelihood ratios. RESULTS Cardiologists (Accuracy: 89.6%) and EMSs (Accuracy: 87.8%) significantly outperformed both GPT-5 (Accuracy: 69.9%) and GPT-4o (Accuracy: 55.9%) (p<0.001). While GPT-5's sensitivity (85.5%) was statistically comparable to clinicians (86.9%-88.6%), it exhibited a critically high false-positive (overcall) rate (45.6%) compared to cardiologists (7.7%) and EMSs (13.1%). GPT-4o's sensitivity was significantly lower (76.9%). GPT-5 showed substantial response consistency (Fleiss' Kappa=0.76), while GPT-4o's was fair (Fleiss' Kappa=0.26). CONCLUSION GPT-5 approached clinician-level sensitivity but was unreliable due to an extremely high false-positive rate. GPT-4o's performance was poor. Current LLMs are not reliable as independent diagnostic tools for STEMI.
Kokulu et al. (Sat,) studied this question.