Multimodal artificial intelligence (AI) aims to integrate imaging, clinical, molecular, and textual data to improve cancer diagnosis, early detection, and prognostication beyond what is achievable with single modalities alone. Recent work across breast, lung, cervical, pancreatic, oral, and brain tumors demonstrates consistent performance gains when fusing imaging with structured clinical data, omics, or clinical text, particularly for diagnostic and short‑horizon prediction tasks. Multimodal radiogenomic frameworks for glioma and other malignancies have advanced sophisticated fusion strategies—such as attention-gated tensor fusion, co‑attention transformers, and interpretable multi‑omics integration—primarily in the context of survival and recurrence prediction. Only a small subset of studies rigorously frame pre‑diagnostic incident cancer prediction with temporally constrained inputs, most notably in early‑stage pancreatic cancer using large‑scale electronic health records (EHR) including clinical notes. Across modalities, intermediate or late fusion of modality‑specific encoders usually outperforms unimodal models and naïve concatenation, with absolute AUC or C‑index gains on the order of 0.05–0.10 in many reports. Interpretability remains an active area, with most studies providing modality‑specific explanations (e.g., Grad‑CAM for imaging, feature attributions for clinical and omic variables), and only a few frameworks embedding biological or cross‑modal interpretability by design. Within the current literature, no convincing examples integrate wearables or sensor data jointly with EHR for early cancer detection, highlighting a clear gap for future work. This narrative review synthesizes trends in modality combinations, fusion architectures, temporal framing, evaluation, and interpretability, and outlines opportunities for truly longitudinal, pre‑diagnostic multimodal cancer risk modeling.
Ghiasi et al. (Thu,) studied this question.