e20000 Background: Multimodal machine-learning (ML) models that fuse imaging, pathology, omics, and clinical data may improve overall-survival (OS) prediction in non–small-cell lung cancer (NSCLC) beyond staging-based tools. We systematically reviewed the design, performance, and methodological quality of these models. Methods: Following PRISMA 2020, we searched Ovid (MEDLINE/Embase/CENTRAL/CDSR), Scopus, IEEE Xplore, and arXiv (January 2017–July 2025). Eligible studies developed or validated ML models integrating ≥2 modalities to predict OS in adults with NSCLC and reported either time-to-event or fixed-horizon binary outcomes. Two reviewers independently screened, extracted, and assessed risk of bias using PROBAST with PROBAST-AI items. Due to heterogeneity, results were synthesized narratively. Results: We included 18 studies (2021–2025) with per-study sample sizes ranging from 115 to 2,898. Outcome framing: time-to-event OS only (n=11), fixed-horizon binary OS only (n=4), and both (n=3). Modalities most often used were clinical structured data (15/18), CT (12/18), PET (6/18), molecular omics (7/18), pathology whole-slide images (4/18), and EHR text (1/18). Fusion strategies clustered as early/concatenation (10/18), interaction-based (attention/bilinear/graph; 5/18), and late/score-level (3/18). For time-to-event OS, internal C-indices ranged 0.658–0.893, with the highest internal value 0.893 (CT+clinical). One study reported external C-index (0.678, pathology+genes). An additional study reported external time-dependent AUC 0.845 at 1-year for a PET/CT-genomics survival model (n=32). For binary OS, internal AUROCs were 0.802–0.888 (2–5-year horizons), and internal accuracies ranged 0.68–0.93 (1–5 years). External binary performance included accuracy 0.72 at 1-year in an immunotherapy cohort. Across studies, multimodal models typically outperformed the best single-modality comparator by ~+0.06 C-index or AUROC, though absolute gains varied. Risk of bias was frequently high in the analysis domain (internal-only validation, optimistic tuning, sparse calibration reporting); code/weights were publicly available in 5/18 studies. Conclusions: Multimodal ML models for NSCLC show consistent, modest improvements in OS discrimination versus single-modality approaches, with CT+clinical the most translationally pairing. However, independent validation, calibration, and transparency remain limited, constraining clinical adoption. Future work should prioritize multi-center datasets, standardized reporting, and open workflows.
Jacome et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: