Abstract Introduction: Blood-based liquid biopsies offer potential for non-invasive cancer screening. However, detecting early-stage disease is complicated by low levels of circulating tumor biomarkers and background noise from normal cells. To address this, we developed a novel deep-learning framework to detect cancer signal at the resolution of single DNA reads. Applied to bisulfite-converted cell-free DNA (cfDNA) samples, our method significantly improves early-stage cancer sensitivity. Methods: We designed a massively parallel 2-D convolutional neural network architecture that differentiates cancer and non-cancer signal in cfDNA by learning local methylation patterns at thousands of genomic regions. The model takes next generation sequencing (NGS) data as input, encodes aligned sequences within genomic windows as images, and outputs informative feature vectors for classification. However, training is complicated by two real-world data limitations: (1) disease samples contain a mix of unlabeled fragments from normal and diseased cells, and (2) acquiring sufficient early-stage disease data is costly, burdensome, and time-intensive. To address these, we designed a novel data generation technique that (1) assigns positive labels for groups of reads via in silico spike-in of tumor biopsy reads and (2) generates large, diverse datasets via fine-tuned in silico mixing of non-cancer cfDNA reads. Our model’s architecture has key advantages: compact input encoding, interpretable saliency maps, and scalable parallel architecture: we trained on 720TB of data across 10 million genomic bases in a single day, highlighting our framework’s efficiency. Results: We validated our method with targeted bisulfite sequencing data from the CORE-HH clinical study (NCT05435066, N=1229 non-cancers, N=1118 cancers, including N=599 Stage I/II). We pretrained the model on 1.3 billion training examples generated using a held-out set of non-cancer plasma (N=174) and tumor tissue biopsies (N=505). Predictions on data from clinical samples yielded feature vectors, with saliency maps confirming the model highlights biopsy-learned patterns. In a 10x5 cross-validation, classifiers trained on these feature vectors improved overall sensitivity by 9.9 points (Stage I: +6.5 pts, II: +17.6 pts, III: +14.3 pts, IV: +9.5 pts) at 98.5% specificity, compared to classifiers trained on region-wide average methylation values. These performance improvements, coupled with the scalability of the framework, underscore its potential as a transformative tool in the early diagnosis of cancer and establish a foundation for training models on NGS data in other liquid biopsy assays. Future work will investigate the potential to incorporate per-read embeddings from DNA-based large language models, without sacrificing scalability. Citation Format: Jackson A. Killian, Kade Pettie, Kyle Gowen, Shiva Farashahi, Esther Brown, Feras Hantash, Jocelyn Charlton, Franziska Michor, Kieran I. Chacko, Dorna Kashef. Improving early cancer detection by training scalable deep neural networks to extract tumor signal from cell-free DNA abstract. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 5465.
Killian et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: