What is the clinical evidence from this study?

Study design: Observational. Population: Suspected Pulmonary Embolism (n=140). Intervention: Microsoft Copilot vs. Wells score. Primary outcome: Inclusion of pulmonary embolism in the top 10 differential diagnoses (OR 3.41, 95% CI 1.04-11.17).

July 13, 2025Open Access

Performance of Microsoft Copilot in the Diagnostic Process of Pulmonary Embolism

Key Result

Microsoft Copilot identified pulmonary embolism in the top 10 differential diagnoses in 94.3% of cases and achieved a higher AUC for risk stratification than the Wells score (0.713 vs 0.583).

Study Design

Type

Observational (n=140)

Structured PICO

Does Microsoft Copilot improve the diagnostic identification and risk assessment of pulmonary embolism compared to the Wells score in clinical vignettes?

Population

140 clinical vignettes of adult patients (≥18 years) with suspected pulmonary embolism who underwent CTPA (70 with confirmed PE, 70 without PE), derived from published case reports within the last 10 years. Mean age 54, 54.3% female.

Intervention

Microsoft Copilot (GPT-4 integration, 'precise' mode) analyzing clinical vignettes to generate a top 10 differential diagnosis list and predict the risk of pulmonary embolism.

Comparator

Wells score calculated independently by two investigators based on the review of the same clinical vignettes.

Outcome

Ability of Microsoft Copilot to accurately identify pulmonary embolism based on clinical data by listing it within the top 10 differential diagnosis list.

Microsoft Copilot demonstrated high accuracy in including pulmonary embolism in differential diagnoses and outperformed the Wells score in risk stratification using clinical vignettes.

Main Result

Effect estimate: OR 3.41 (95% CI 1.04-11.17)

Absolute Event Rate: 94.3% vs 82.9%

Limitations

Publication bias in the included case reports, which may disproportionately feature high-risk or ambiguous cases
Written clinical vignettes may not fully represent real-world presentations and omit general appearance
Case matching was performed based only on age and sex, not controlling for comorbidities
Investigators adjudicating the Wells score were not blinded to the full text of the vignette
Probabilistic nature of large language models introduces variability in outputs and limits reproducibility

Abstract

INTRODUCTION: Patients with pulmonary embolism (PE) often present with non-specific signs and symptoms mimicking other conditions and complicating diagnosis. In this study we aimed to evaluate the performance of an artificial-intelligence tool, Microsoft Copilot, in the diagnostic process of PE, using clinical data including demographics, complaints, and vital signs. METHODS: We conducted this study using 140 clinical vignettes, including 70 patients with and 70 patients without PE. The vignettes were derived from published case reports within the last 10 years. We used Copilot for its free GPT-4 integration to analyze clinical data and answer two questions after each vignette. We compared Copilot's ability to identify PE within the top 10 differential diagnoses, and its ability to predict the risk of PE when compared to the use of the Wells score by two independent investigators. RESULTS: Copilot correctly included PE in the differential diagnosis in 94.3% of cases by listing it within the top 10 conditions. Risk assessment by Copilot yielded significantly higher levels in patients with PE (P.05). Copilot demonstrated better discriminatory power than the Wells score in risk assessment of PE (area under the curve 0.713 vs 0.583), with statistical significance (P<0.001 vs P=.091). Sensitivity, specificity, positive predictive value, and negative predictive value for discriminating between the combination of low- and intermediate- vs high-risk categories were 34%, 97.1%, 92.3%, and 59.6%, respectively. CONCLUSION: This study explores the potential of Copilot as a tool in clinical decision-making, demonstrating a high rate of correctly identifying PE and improved performance over the Wells score. However, further validation in larger populations and real-world settings is crucial to fully realize its potential.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Banu Arslan

Sağlık Bilimleri Üniversitesi

Mehmet Necmeddin Sutaşır

Ministry of Health

Ertuğrul Altınbilek

University of Health Science

Journals

Western Journal of Emergency Medicine

Actions

Institutions

Ministry of Health

Şişli Etfal Eğitim ve Araştırma Hastanesi

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Performance of Microsoft Copilot in the Diagnostic Process of Pulmonary Embolism

Key Result

Study Design

Structured PICO

Main Result

Limitations

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study