A transformer-based multimodal temporal foundation model achieved a mean AUROC of 0.77 across 246 downstream prediction tasks, outperforming age-sex and clinical text baselines.
Cohort (n=7,200,000)
Does a transformer-based multimodal temporal foundation model improve prediction of disease onset, progression, and treatment response compared to standard baselines in a large healthcare cohort?
7.2 million patients from a retrospective cohort in a major U.S. healthcare system, with data spanning 33 years and 25 billion medical events across 28 clinical modalities.
Transformer-based multimodal temporal foundation model
Age-sex baseline, clinical text baseline, and task-specific supervised baselines
Mean AUROC across 246 downstream prediction tasks (including new onset of 87 diseases, progression of 56 diseases, treatment response for 100 therapy-outcome pairs, and three short-term operational tasks)
A multimodal temporal foundation model integrating 28 clinical modalities across 7.2 million patients achieved a mean AUROC of 0.77 across 246 prediction tasks, outperforming standard baselines and enabling automated cohort discovery.
Abstract Healthcare data are fragmented across time and modalities, including clinical reports, imaging, and lab tests. While Electronic Health Records (EHRs) capture rich longitudinal health trajectories, current predictive modeling approaches typically model individual modalities in isolation, missing the context needed to understand complex diseases such as cancers. To bridge this gap, we aim to synthesize the entirety of a patient's medical history into a unified computable representation. We curated a retrospective cohort from a major U.S. healthcare system, comprising 25 billion medical events from 7.2 million patients spanning 33 years. This dataset integrates 28 distinct clinical modalities, including structured data (diagnoses, medications, vital signs, flowsheet, and laboratory results), clinical notes, and imaging data. We developed a transformer-based multimodal temporal foundation model that tokenizes each modality with modality-specific encoders and fuses events over time into a unified patient embedding. We evaluated frozen patient embeddings on 246 downstream prediction tasks, including new onset of 87 diseases, progression of 56 diseases, treatment response for 100 therapy-outcome pairs, and three short-term operational tasks. Across all tasks, the model achieved a mean AUROC of 0.77, outperforming age-sex, clinical text, and task-specific supervised baselines. On oncology-focused tasks spanning solid and hematologic malignancies and systemic therapies, the model outperformed the age-sex baseline by 9% for new neoplasm onset, 18% for neoplasm progression, and 16% for treatment response. Unsupervised clustering of patient embeddings recovered clinically coherent groupings of cancer types, comorbidities, and treatment patterns, forming a multiscale, data-driven atlas of medical phenotypes. The same embeddings enabled similarity search to identify patients with comparable trajectories, supporting automated cohort discovery and fine-grained clinical trial matching. Gradient-based interpretability analyses identified multimodal risk factors for disease onset and treatment response that aligned with clinical expectations, providing transparent attribution at both patient and population level. A single multimodal, temporally aware EHR foundation model can learn general-purpose whole-patient representations that support accurate early prediction and phenotyping of cancer outcomes while remaining applicable across diverse diseases. By consolidating fragmented data into a continuously updated patient representation, this approach lays the groundwork for shifting oncology from reactive, episodic care to proactive, continuous risk management, and provides a scalable basis for risk stratification, trial optimization, and discovery of clinically interpretable multimodal biomarkers. Citation Format: Andrew Zhang, Tong Ding, Sophia J. Wagner, Caiwei Tian, Ming Yang Lu, Alexandre Misrahi, Joshua E. Lewis, Rowland Pettit, Long P. Le, Faisal Mahmood. A healthcare system scale multimodal whole patient temporal foundation model abstract. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 33.
Building similarity graph...
Analyzing shared references across papers
Loading...
J. Andrew Zhang
Moorpark College
Tong Ding
Sophia J. Wagner
Brigham and Women's Hospital
Cancer Research
Brigham and Women's Hospital
Massachusetts General Hospital
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Fri,) conducted a cohort in Various diseases including cancers (n=7,200,000). Transformer-based multimodal temporal foundation model vs. Age-sex, clinical text, and task-specific supervised baselines was evaluated on Performance on 246 downstream prediction tasks (mean AUROC). A transformer-based multimodal temporal foundation model achieved a mean AUROC of 0.77 across 246 downstream prediction tasks, outperforming age-sex and clinical text baselines.
synapsesocial.com/papers/69d1fe68a79560c99a0a4b4b — DOI: https://doi.org/10.1158/1538-7445.am2026-33