Abstract Rationale Deep learning (DL)-based quantitative scores have been increasingly applied to assess interstitial lung disease (ILD) using volumetric high-resolution CT (HRCT) scans. These methods have shown strong potential for efficient and reproducible staging and screening. Before DL approaches, radiomic machine learning (ML) models were widely used to extract imaging features for quantifying fibrotic and interstitial abnormalities. This study aimed to compare longitudinal changes in ILD burden derived from DL and ML models. Methods The Imaging Signature IPF (IS-IPF) study is a retrospective cohort designed to characterize CT-based imaging biomarkers in 500 subjects with ILD collected between March 2004 and October 2019. The baseline mean (±SD) age was 64.6 ± 8.3 years; mean percent predicted FVC and DLCO were 72.9% ± 16.8 and 59.3% ± 18.8, respectively. Among these, 180 subjects had follow-up HRCT scans (mean ± SD: 12.2 ± 5.7 months). Quantitative lung fibrosis (QLF) and quantitative ILD (QILD) scores were computed using (1) a residual convolutional neural network-based DL model and (2) a radiomic ML model. We compared QLF and QILD scores between models at baseline using equivalence t-tests and evaluated changes from baseline using two-sample t-tests between progression groups defined by the single time point prediction (STP) score. Results Among 180 patients with paired HRCTs, DL scores were available for 61 volumetric scans. Changes in ML and DL QLF were equivalent within a 2% margin (p = 0.0004), with mean (SE) changes of 0.92% (0.86) and 0.61% (0.73), respectively. Similarly, changes in QILD were equivalent at the 2% limit (p = 0.0066), with mean (SE) changes of 0.27 (1.30) for ML and 0.19 (1.05) for DL scores. In two groups, 12 subjects had STP ≥30% and 49 had STP 30%. For the STP ≥30% group, the mean (SE) changes were 5.6% (1.7) vs − 0.2% (0.9) in ML QLF, and 5.6% (2.5) vs − 0.6% (0.6) in DL QLF. For QILD, mean (SE) changes were 4.1% (2.5) vs − 0.7% (1.5) for ML, and 4.9% (3.1) vs − 1.0%(1.0) for DL. Baseline STP was significantly associated with longitudinal changes in quantitative CT measures mostly (p = 0.0005, p = 0.0067, p = 0.0265, and p = 0.1411 for DL QLF, ML QLF, DL QILD, and ML QILD, respectively: Figure 1). Conclusions Deep learning-based quantitative scores derived from volumetric HRCT scans perform comparably to radiomic ML models in assessing longitudinal changes in ILD. These findings support the reliability of DL-based quantitative imaging biomarkers for disease monitoring and clinical research applications. This abstract is funded by: Boehringer Ingelheim
Kim et al. (Fri,) studied this question.