What is the clinical evidence from this study?

Study design: Observational. Population: Wolff-Parkinson-White syndrome with manifest accessory pathways (n=49). Intervention: ChatGPT 5 Thinking vs. Gemini 2.5 Pro. Primary outcome: Repeated-run diagnostic accuracy against the electrophysiology-confirmed pathway location (95% CI 11.5-26.6).

What does this research mean for the field?

General-purpose multimodal large language models demonstrate poor accuracy and reproducibility in localizing manifest accessory pathways from 12-lead ECGs, rendering them unsuitable for clinical use. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

To evaluate the diagnostic accuracy of general-purpose AI systems for localising accessory pathways in patients with Wolff–Parkinson–White syndrome.

May 26, 2026Open Access

Benchmarking General Purpose Artificial Intelligence for Accessory Pathway Localisation on 12-Lead Electrocardiograms: A Proof-of-Concept Study

Key Result

General-purpose AI models ChatGPT 5 Thinking and Gemini 2.5 Pro demonstrated poor repeated-run diagnostic accuracy (19.0% and 12.2%, respectively) for localising accessory pathways on 12-lead ECGs.

Key Points

To evaluate the diagnostic accuracy of general-purpose AI systems for localising accessory pathways in patients with Wolff–Parkinson–White syndrome.
Retrospective, single-centre study including 49 patients with accessory pathways confirmed by electrophysiology study/ablation.
Analyzed anonymised 12-lead ECGs with ChatGPT 5 Thinking and Gemini 2.5 Pro through three context-reset runs.
Primary outcome was repeated-run diagnostic accuracy against the confirmed pathway location.
ChatGPT 5 Thinking localized 28/147 outputs (19.0% accuracy, 95% CI 11.5–26.6).
Gemini 2.5 Pro localized 18/147 outputs (12.2% accuracy, 95% CI 6.8–17.7).
Both models performed below the baseline of 36.7%, with frequent no-consensus outputs.

Study Design

Type

Observational (n=49)

Multicenter

Structured PICO

Do general-purpose multimodal large language models (ChatGPT 5 Thinking and Gemini 2.5 Pro) accurately localize manifest accessory pathways on 12-lead ECGs compared to electrophysiology study?

Population

49 consecutive patients with manifest accessory pathways confirmed during electrophysiology study/ablation in a retrospective, single-centre study.

Intervention

Analysis of pre-procedural 12-lead ECGs by general-purpose multimodal large language models (ChatGPT 5 Thinking and Gemini 2.5 Pro) using predefined EASY-WPW anatomical categories, tested in three independent context-reset runs.

Comparator

Electrophysiology-confirmed reference standard.

Outcome

Repeated-run diagnostic accuracy against the electrophysiology-confirmed pathway location.

General-purpose multimodal large language models demonstrate poor accuracy and reproducibility for localizing manifest accessory pathways on 12-lead ECGs, indicating they are not currently suitable for clinical use in this context.

Main Result

Absolute Event Rate: 19% vs 12.2%

Limitations

Class imbalance
Small subgroup denominators

Abstract

Background/Objectives: Accurate localisation of manifest accessory pathways from the 12-lead electrocardiogram remains clinically relevant in Wolff–Parkinson–White syndrome, particularly for pre-procedural planning. Although purpose-built artificial intelligence models have shown promise in ECG interpretation, the reliability of general-purpose multimodal large language models for accessory pathway localisation is unknown. We evaluated two contemporary general-purpose AI systems against an electrophysiology-confirmed reference standard and assessed reproducibility across repeated analyses. Methods: In this retrospective, single-centre proof-of-concept diagnostic accuracy study, 49 consecutive patients with manifest accessory pathways confirmed during electrophysiology study/ablation were included. Anonymised pre-procedural 12-lead ECGs were compiled into a single PDF and analysed by ChatGPT 5 Thinking and Gemini 2.5 Pro using predefined EASY-WPW anatomical categories. Each model was tested in three independent context-reset runs. The primary outcome was repeated-run diagnostic accuracy against the electrophysiology-confirmed pathway location, with confidence intervals calculated using an ECG-clustered approach. Secondary outcomes included majority-vote accuracy, pathway-specific descriptive accuracy, exact output consistency, no-consensus outputs, and “unable to identify” responses. Results: Each model generated 147 repeated outputs from the same 49 ECGs. ChatGPT 5 Thinking correctly localised 28/147 outputs, corresponding to a repeated-run accuracy of 19.0% (ECG-clustered 95% CI 11.5–26.6), while Gemini 2.5 Pro correctly localised 18/147 outputs, corresponding to 12.2% accuracy (95% CI 6.8–17.7). Both models performed below the no-information majority-class baseline of 36.7%. Majority-vote accuracy was 7/49 for ChatGPT 5 Thinking and 2/49 for Gemini 2.5 Pro. Exact output consistency across all three runs was observed in 2/49 ECGs for ChatGPT 5 Thinking and 0/49 ECGs for Gemini 2.5 Pro. Complete no-consensus outputs occurred in 30/49 and 26/49 ECGs, respectively. “Unable to identify” responses were infrequent: 8/147 outputs for ChatGPT 5 Thinking and 2/147 outputs for Gemini 2.5 Pro. Pathway-specific estimates were descriptive only because of class imbalance and small subgroup denominators. Conclusions: General-purpose multimodal large language models demonstrated poor repeated-run accuracy, very low reproducibility, frequent no-consensus outputs, and limited abstention when localising manifest accessory pathways from 12-lead ECGs. These findings do not support their current clinical use for accessory pathway localisation. Future progress is more likely to come from purpose-built, signal-native, or rigorously validated multimodal cardiac AI systems.