Abstract Recent genomic foundation models have advanced DNA sequence interpretation, yet most remain constrained to local sequence patterns and fail to produce the patient-level insights required for clinical decision-making. To address this limitation, we developed a framework that extends beyond sequence-level inference, enabling robust patient stratification through a Cancer Foundation Model. Our approach begins with “DNAChunker”, which employs a dynamic H-net-based tokenization strategy that divides the genome into variable-length segments, preserving high-resolution detail in regulatory and coding regions while efficiently compressing repetitive sequences. When evaluated on the Nucleotide Transformer and Genomic Benchmarks, DNAChunker achieved performance comparable to the state-of-the-art GENERator (1.2 billion parameters) while using only 156 million parameters. To translate these genomic embeddings into patient-level insights, we implemented a transformer-based Cancer Aggregation Model that integrates mutation embeddings with somatic copy-number alteration (SCNA) features. The framework was evaluated on large whole-genome sequencing (WGS) cohorts, including PCAWG (n=2,040) and CUBRICS breast cancer samples (n=1,053), with TCGA-BRCA (breast cancer; n=920) serving as an external validation cohort. The model effectively stratified patients by cancer type (accuracy, 96.89%), homologous recombination deficiency (HRD; accuracy, 92.83%), and PAM50 subtype (accuracy, 84.05%). Notably, it classified PAM50 intrinsic subtypes using only DNA-level information, eliminating the conventional reliance on RNA-based expression profiling. The Cancer Foundation Model demonstrates that patient-level representation learning from whole-genome data can achieve clinically meaningful stratification across diverse tumor types. By bridging the gap between genomic sequence interpretation and actionable phenotypic classification, this framework establishes a foundation for AI-based precision oncology. With further validation, it will facilitate biomarker discovery and patient stratification in clinical trials directly from WGS data. Citation Format: Jonghoon Lee, Chunyang Bao, Hansol Park, Gang-Hee Lee, Yoonsuh Lee, Beomki Lee, David Lehotzky, Ron Solan, Antonia Kowalewski, Xavi Loinaz, Vasuki Narasimha Swamy, David I. Heiman, Samantha Van Seters, Saveliy Belkin, Sam Wiseman, Andrew D. Cherniack, Luis Antonio Corchete Sanchez, Brian P Danysh, Zachary Everton, Chip Stewart, Haruna Tomono, Gengchao Wang, Esther Rheinbay, Gad Getz, Young Seok Ju, Won-Chul Lee, Ryul Kim. AI-Driven stratification of cancer patients using The Cancer Genome Atlas whole-genome sequencing data abstract. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 7268.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jonghoon Lee
Chunyang Bao
Hansol Park
Cancer Research
University of California, San Diego
Broad Institute
Korea Advanced Institute of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Lee et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69d1fe07a79560c99a0a47b1 — DOI: https://doi.org/10.1158/1538-7445.am2026-7268