What question did this study set out to answer?

This study aims to assess the diagnostic accuracy of GPT-5 and GPT-4o for STEMI using 12-lead ECGs compared to cardiologists and EMSs.

March 9, 2026

Accuracy of GPT-5 and GPT-4o in diagnosing STEMI from 12-Lead ECGs: A comparative study with cardiologists and emergency physicians

Key Points

This study aims to assess the diagnostic accuracy of GPT-5 and GPT-4o for STEMI using 12-lead ECGs compared to cardiologists and EMSs.
Case-control design involving 234 patients (117 STEMI cases, 117 controls)

Structured PICO

Do GPT-5 and GPT-4o accurately diagnose STEMI from 12-lead ECGs compared to cardiologists and emergency physicians in patients presenting with chest pain?

Population

234 patients presenting with chest pain, comprising 117 angiography-confirmed STEMI cases and 117 age- and sex-matched controls without STEMI

Intervention

Diagnostic interpretation of anonymized 12-lead ECGs by large language models (GPT-5 and GPT-4o)

Comparator

Diagnostic interpretation by 3 emergency medicine specialists (EMSs) and 3 cardiologists

Outcome

Diagnostic accuracy for STEMIsurrogate

Current large language models like GPT-5 and GPT-4o are not reliable as independent diagnostic tools for STEMI due to high false-positive rates and inferior overall accuracy compared to clinicians.

Abstract

BACKGROUND ST-segment elevation myocardial infarction (STEMI) requires rapid, accurate electrocardiogram (ECG) interpretation. The diagnostic effectiveness of new large language models (LLMs) like GPT-5 and GPT-4o in this high-risk area remains a critical knowledge gap. OBJECTIVES We aimed to evaluate the diagnostic performance of GPT-5 and GPT-4o in diagnosing STEMI from 12-lead ECGs, comparing them against emergency medicine specialists (EMSs) and cardiologists. METHODS In a case-control study of 234 patients, we included 117 angiography-confirmed STEMI cases and 117 age- and sex-matched controls presenting with chest pain but without STEMI. Anonymized ECG images were presented to 3 EMSs, 3 cardiologists, GPT-5, and GPT-4o with a dichotomous (yes/no) question: "Is there a STEMI?"AI models were queried three times on different days to measure response consistency. Performance was compared using accuracy, sensitivity, specificity, predictive values, and likelihood ratios. RESULTS Cardiologists (Accuracy: 89.6%) and EMSs (Accuracy: 87.8%) significantly outperformed both GPT-5 (Accuracy: 69.9%) and GPT-4o (Accuracy: 55.9%) (p<0.001). While GPT-5's sensitivity (85.5%) was statistically comparable to clinicians (86.9%-88.6%), it exhibited a critically high false-positive (overcall) rate (45.6%) compared to cardiologists (7.7%) and EMSs (13.1%). GPT-4o's sensitivity was significantly lower (76.9%). GPT-5 showed substantial response consistency (Fleiss' Kappa=0.76), while GPT-4o's was fair (Fleiss' Kappa=0.26). CONCLUSION GPT-5 approached clinician-level sensitivity but was unreliable due to an extremely high false-positive rate. GPT-4o's performance was poor. Current LLMs are not reliable as independent diagnostic tools for STEMI.

Bookmark

Accuracy of GPT-5 and GPT-4o in diagnosing STEMI from 12-Lead ECGs: A comparative study with cardiologists and emergency physicians

Key Points

Structured PICO

Abstract

Cite This Study