Does AI-assisted ECG interpretation (GE 12SL algorithm) agree with expert cardiologist annotations for emergency-critical diagnoses, and what are the cognitive risks of algorithmic errors?
AI ECG interpretation algorithms frequently disagree with expert cardiologists on critical diagnoses and may actively mislead physicians by confidently substituting incorrect labels for missed lethal conditions, projecting worse performance for the AI-physician combination than the physician alone.
Background: Artificial intelligence algorithms for electrocardiogram (ECG) interpretation are now standard in emergency departments worldwide, yet the assumption that AI and physician errors are complementary — and therefore self-correcting — has never been systematically tested against the cognitive realities of emergency medicine practice. Objective: To develop a cognitive risk taxonomy that maps specific AI error patterns through specific physician bias pathways to specific clinical risk predictions for emergency-critical ECG diagnoses. Methods: We analysed 21,799 dual-labelled ECGs from the PTB-XL+ dataset, comparing expert cardiologist annotations against GE 12SL algorithm output across 54 emergency-critical SNOMED CT concepts spanning acute myocardial infarction, life-threatening arrhythmias, conduction blocks, and ischaemia. A five-stage disagreement analysis framework quantified error direction, magnitude, confidence profiles, compound co-occurrence patterns, and diagnostic substitution profiles. Each disagreement signature was mapped to cognitive bias pathways derived from dual-process theory and scored on a composite of frequency, clinical severity, and bias amplification potential. Results: Of 106,401 non-trivial comparisons, 93.7% were discordant (6.3% agreement), with 81% of the dataset carrying at least one emergency-critical disagreement (mean 5.7 per affected ECG). Eight named disagreement signatures were identified, organised into a two-tier, four-class taxonomy: Lethal Diagnosis Miss (anteroseptal MI blind spot, SVT/VT inversion, LBBB–STEMI mask), Mechanism Blindness (bradycardia inflation, fascicular desert), Signal Corruption (ischaemia noise floor, old MI avalanche), and Self-Undermining AI (QT alarm fatigue). Seven were classified as CRITICAL risk. The AI’s binary confidence architecture delivered 81% of overcalls at maximum confidence with no hedging, creating near-maximum anchoring potential for every error. Diagnostic substitution profiling revealed that AI misses are not silent omissions but active reframings: when the AI misses anteroseptal MI, it labels “Old MI” on 60.6% of those ECGs; when it misses ventricular tachycardia, it labels “SVT” on 45.8%. Conclusions: The Compound Risk Hypothesis was validated across all eight danger zones: in every case, the AI–physician combination was projected to perform worse than the physician alone through three mechanisms — Direct Suppression, Capability Destruction, and Environmental Contamination. A seven-component safeguard framework targeting class-specific bias pathways was developed, designed to be net-negative on alert burden. The taxonomy framework is system-agnostic and designed for reapplication to any AI system with dual-label validation data.
Dabaliz et al. (Sun,) studied this question.