What question did this study set out to answer?

This study estimates the accuracy of deep learning models for diagnosing and grading chronic obstructive pulmonary disease (COPD).

January 23, 2026Open Access

Accuracy of Deep Learning in Diagnosing Chronic Obstructive Pulmonary Disease: Systematic Review and Meta-Analysis

Key Points

This study estimates the accuracy of deep learning models for diagnosing and grading chronic obstructive pulmonary disease (COPD).
Systematic search of databases including Cochrane Library, Embase, Web of Science, and PubMed.
Quality Assessment of Diagnostic Accuracy Studies-2 tool used for risk of bias evaluation.
Subgroup analyses conducted based on validation set generation methods and imaging data sources.
Bivariate mixed effects model for binary outcomes and random-effects models for multiclass outcomes were used.
56 studies included with 886,753 participants.
Pooled sensitivity for binary classification of COPD was 0.87 and specificity was 0.88, with an AUC of 0.93.
CT-based models showed a sensitivity of 0.86 and specificity of 0.87; respiratory sound models had a higher sensitivity of 0.91 and specificity of 0.96.
Limited accuracy found for multiclass GOLD stages with sensitivity ranging from 61.7% to 84.2%.

Abstract

Abstract Background Chronic obstructive pulmonary disease (COPD) is a common chronic lung disease. Deep learning (DL), a data-driven machine learning approach, has gained attention in clinical practice, particularly for diagnosing COPD and grading its severity. However, systematic evidence of its diagnostic and grading accuracy remains limited, posing challenges for developing intelligent diagnostic tools. Objective This study aimed to systematically estimate the accuracy of DL models for diagnosing and grading COPD, providing up-to-date evidence for the design and clinical implementation of intelligent detection tools. Methods The Cochrane Library, Embase, Web of Science, and PubMed were systematically searched for studies on DL for diagnosing COPD and grading its severity published up to November 1, 2025. Risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies-2 tool. Subgroup analyses by the validation set generation method and imaging data source were conducted, and meta-analyses were performed on the validation sets. For binary outcomes, diagnostic 2×2 tables were synthesized using a bivariate mixed effects model; for multiclass outcomes, accuracy estimates were pooled using random-effects models. Results In total, 56 studies comprising 886,753 participants were included. Inputs were computed tomography (CT) imaging (n=30), breath sounds or audio (n=12), conventional chest X-ray (n=2), X-ray film (n=2), and other modalities (n=10), including pulmonary function indices or curves or physiological waveforms, electrocardiograms, volumetric capnography maps, radiogenetic data, and clinical scores. For binary classification of COPD, DL models yielded a pooled sensitivity of 0.87 (95% CI 0.85‐0.90), specificity of 0.88 (95% CI 0.84‐0.92), diagnostic odds ratio (DOR) of 52 (95% CI 30‐88), and the area under the summary receiver operating characteristic curve (AUC) of 0.93. For CT-based DL models, pooled sensitivity was 0.86 (95% CI 0.84‐0.89), specificity was 0.87 (95% CI 0.82‐0.90), DOR was 42 (95% CI 26‐68), and AUC was 0.92. For respiratory sound–based models, sensitivity was 0.91 (95% CI 0.84‐0.95), specificity was 0.96 (95% CI 0.91‐0.98), DOR was 237 (95% CI 78‐723), and AUC was 0.98. In multiclass classification, the DL models showed limited accuracy in discriminating Global Initiative for Chronic Obstructive Lung Disease (GOLD) stages: GOLD stage 0 (84.2%, 95% CI 60.5%‐98.2%), stage 1 (61.7%, 95% CI 40.7%‐80.8%), stage 2 (67.9%, 95% CI 37.6%‐91.7%), stage 3 (70.8%, 95% CI 16.3%‐100%), and stage 4 (70.8%, 95% CI 16.3%‐100%). Conclusions This study is the first systematic synthesis of DL applications for COPD detection and GOLD staging. DL models based on CT images and breath sounds show high accuracy for binary COPD detection, whereas multiclass GOLD grading remains concerning. These findings support the development and updating of artificial intelligence−assisted COPD screening tools; however, substantial heterogeneity and limited external validation warrant cautious interpretation. Future reproducible multicenter studies with standardized reporting are needed.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Liu et al. (Wed,) studied this question.

synapsesocial.com/papers/69730fe2c8125b09b0d1fa90 https://doi.org/https://doi.org/10.2196/83459

Bookmark

View Full Paper