What question did this study set out to answer?

This review aims to explore advancements in multimodal artificial intelligence to enhance cancer diagnostics and predictive accuracy.

May 11, 2026Open Access

Multimodal Artificial Intelligence for Predictive and Early Cancer Diagnostics: A Narrative Review

Key Points

This review aims to explore advancements in multimodal artificial intelligence to enhance cancer diagnostics and predictive accuracy.
Narrative review of recent work on integrating imaging, clinical, and molecular data in cancer diagnostics.
Analysis of various fusion strategies and frameworks for cancer prediction, focusing on interpretability and performance.
Examination of studies utilizing electronic health records for early-stage cancer risk modeling.
Integration of multimodal data improves diagnostic accuracy and prediction performance for various cancers.
Absolute gains in performance metrics, such as AUC, of 0.05–0.10 reported when using advanced fusion strategies.
Limited examples of studies combining wearable data with EHR for early detection, indicating a significant area for future exploration.

Abstract

Multimodal artificial intelligence (AI) aims to integrate imaging, clinical, molecular, and textual data to improve cancer diagnosis, early detection, and prognostication beyond what is achievable with single modalities alone. Recent work across breast, lung, cervical, pancreatic, oral, and brain tumors demonstrates consistent performance gains when fusing imaging with structured clinical data, omics, or clinical text, particularly for diagnostic and short‑horizon prediction tasks. Multimodal radiogenomic frameworks for glioma and other malignancies have advanced sophisticated fusion strategies—such as attention-gated tensor fusion, co‑attention transformers, and interpretable multi‑omics integration—primarily in the context of survival and recurrence prediction. Only a small subset of studies rigorously frame pre‑diagnostic incident cancer prediction with temporally constrained inputs, most notably in early‑stage pancreatic cancer using large‑scale electronic health records (EHR) including clinical notes. Across modalities, intermediate or late fusion of modality‑specific encoders usually outperforms unimodal models and naïve concatenation, with absolute AUC or C‑index gains on the order of 0.05–0.10 in many reports. Interpretability remains an active area, with most studies providing modality‑specific explanations (e.g., Grad‑CAM for imaging, feature attributions for clinical and omic variables), and only a few frameworks embedding biological or cross‑modal interpretability by design. Within the current literature, no convincing examples integrate wearables or sensor data jointly with EHR for early cancer detection, highlighting a clear gap for future work. This narrative review synthesizes trends in modality combinations, fusion architectures, temporal framing, evaluation, and interpretability, and outlines opportunities for truly longitudinal, pre‑diagnostic multimodal cancer risk modeling.

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper