General-purpose AI models ChatGPT 5 Thinking and Gemini 2.5 Pro demonstrated poor repeated-run diagnostic accuracy (19.0% and 12.2%, respectively) for localising accessory pathways on 12-lead ECGs.
Observational (n=49)
No
Do general-purpose multimodal large language models (ChatGPT 5 Thinking and Gemini 2.5 Pro) accurately localize manifest accessory pathways on 12-lead ECGs compared to electrophysiology study?
General-purpose multimodal large language models demonstrate poor accuracy and reproducibility for localizing manifest accessory pathways on 12-lead ECGs, indicating they are not currently suitable for clinical use in this context.
Absolute Event Rate: 19% vs 12.2%
Background/Objectives: Accurate localisation of manifest accessory pathways from the 12-lead electrocardiogram remains clinically relevant in Wolff–Parkinson–White syndrome, particularly for pre-procedural planning. Although purpose-built artificial intelligence models have shown promise in ECG interpretation, the reliability of general-purpose multimodal large language models for accessory pathway localisation is unknown. We evaluated two contemporary general-purpose AI systems against an electrophysiology-confirmed reference standard and assessed reproducibility across repeated analyses. Methods: In this retrospective, single-centre proof-of-concept diagnostic accuracy study, 49 consecutive patients with manifest accessory pathways confirmed during electrophysiology study/ablation were included. Anonymised pre-procedural 12-lead ECGs were compiled into a single PDF and analysed by ChatGPT 5 Thinking and Gemini 2.5 Pro using predefined EASY-WPW anatomical categories. Each model was tested in three independent context-reset runs. The primary outcome was repeated-run diagnostic accuracy against the electrophysiology-confirmed pathway location, with confidence intervals calculated using an ECG-clustered approach. Secondary outcomes included majority-vote accuracy, pathway-specific descriptive accuracy, exact output consistency, no-consensus outputs, and “unable to identify” responses. Results: Each model generated 147 repeated outputs from the same 49 ECGs. ChatGPT 5 Thinking correctly localised 28/147 outputs, corresponding to a repeated-run accuracy of 19.0% (ECG-clustered 95% CI 11.5–26.6), while Gemini 2.5 Pro correctly localised 18/147 outputs, corresponding to 12.2% accuracy (95% CI 6.8–17.7). Both models performed below the no-information majority-class baseline of 36.7%. Majority-vote accuracy was 7/49 for ChatGPT 5 Thinking and 2/49 for Gemini 2.5 Pro. Exact output consistency across all three runs was observed in 2/49 ECGs for ChatGPT 5 Thinking and 0/49 ECGs for Gemini 2.5 Pro. Complete no-consensus outputs occurred in 30/49 and 26/49 ECGs, respectively. “Unable to identify” responses were infrequent: 8/147 outputs for ChatGPT 5 Thinking and 2/147 outputs for Gemini 2.5 Pro. Pathway-specific estimates were descriptive only because of class imbalance and small subgroup denominators. Conclusions: General-purpose multimodal large language models demonstrated poor repeated-run accuracy, very low reproducibility, frequent no-consensus outputs, and limited abstention when localising manifest accessory pathways from 12-lead ECGs. These findings do not support their current clinical use for accessory pathway localisation. Future progress is more likely to come from purpose-built, signal-native, or rigorously validated multimodal cardiac AI systems.
Abdelrazik et al. (Sun,) conducted a observational in Wolff-Parkinson-White syndrome with manifest accessory pathways (n=49). ChatGPT 5 Thinking vs. Gemini 2.5 Pro was evaluated on Repeated-run diagnostic accuracy against the electrophysiology-confirmed pathway location (95% CI 11.5-26.6). General-purpose AI models ChatGPT 5 Thinking and Gemini 2.5 Pro demonstrated poor repeated-run diagnostic accuracy (19.0% and 12.2%, respectively) for localising accessory pathways on 12-lead ECGs.