A Random Forest machine learning classifier using whole exome sequencing genomic features predicted four major breast cancer receptor subtypes with 80.3% overall agreement.
Can a machine learning classifier using genomic features accurately predict breast cancer receptor subtypes without relying on immunohistochemistry or gene expression data?
A machine learning classifier using genomic features from whole exome sequencing can accurately predict breast cancer receptor subtypes, offering a potential alternative when tissue samples are inadequate for standard immunohistochemistry.
Absolute Event Rate: 0% vs 0%
Abstract Risk stratification, treatment course, and prognosis for patients with breast cancer presently rely upon the accurate determination of receptor subtype, ascertained through immunohistochemistry (IHC) for estrogen receptor (ER) and progesterone receptor (PR), and evaluation of HER2 expression (IHC and/or gene amplification via in situ hybridization). While IHC-based subtyping assays are informative, they require high-quality tissue samples and the technical assays can be susceptible to fixation artifacts, variability in antibody staining performance, semi-quantitative and subjective result calling. In cases of diminished sample quality, IHC-based subtype assessment may not agree with gene expression-based classification, and alternative approaches may be needed. This study aimed to develop a machine learning classifier able to predict breast cancer receptor subtypes using genomic features, without relying on immunohistochemistry or gene expression data. This study included 19, 559 patients with primary breast cancer, identified using Natera’s proprietary real-world database, linked to a clinical claims database. Hormone receptor (HR) and HER2 subtype was determined from patient treatment codes. We developed a biologically-informed feature set by combining somatic mutations across 19, 820 genes, using whole exome sequencing (WES) data from the SignateraTM testing workflow. Each mutation was assigned a composite mutationₛcore (range 1-12) based on variant class (SNV, insertion, deletion), superclass (SNP/INDEL), predicted impact (VEP annotation impact: MODIFIER to HIGH), and functional consequence (such as frameshift, stop-gain, missense, synonymous). A Random Forest classifier was trained with a stratified 75/25 train-test splitting and hyperparameter optimization. The model was trained on features from 14, 669 patients in the training cohort. In a test cohort of 4, 890 patients, the model achieved 80. 3% overall agreement with HR/HER2 status as inferred through medication claims data, with balanced performance across four major subtypes. Per-subtype metrics were: for HR+/HER2-, the model showed a precision of 0. 935, recall 0. 911, and F1 score of 0. 923; for HR-/HER2+, precision was 0. 714, recall was 0. 753, and F1 score was 0. 783; for HR+/HER2+, precision was 0. 748, recall was 0. 734, and F1 score was 0. 741; lastly, for the TNBC subtype, precision was 0. 730, recall was 0. 816, and F1 score was 0. 770. Overall the genomic classifier accurately classifies breast cancer into one of the four major receptor subtypes. After definitive validation against clinically-reported HR/HER2 status, this classifier could be used to guide analyses of de-identified genomic datasets that lack complete clinical annotation. Citation Format: Sandro Satta, Philip Miller, Samuel Rivero-Hinojosa, Ekaterina Kalashnikova, Angel Rodriguez, Minetta C. Liu,. A machine learning approach to classify breast cancer receptor subtype using genomic features abstract. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts) ; 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86 (7 Suppl): Abstract nr 2724.
Satta et al. (Fri,) reported a other. A Random Forest machine learning classifier using whole exome sequencing genomic features predicted four major breast cancer receptor subtypes with 80.3% overall agreement.