What question did this study set out to answer?

The aim is to develop a predictive model that identifies regions of cancer genomes with high mutation rates based solely on DNA sequences.

March 22, 2026Open Access

A single formula predicts where cancer mutations and epigenetic aging converge from DNA sequence alone

Key Points

The aim is to develop a predictive model that identifies regions of cancer genomes with high mutation rates based solely on DNA sequences.
Introduced a formula termed Ev to analyze 5-kb genomic windows across the human genome.
Analyzed 582,028 non-overlapping genomic windows using the Ev formula.
Conducted statistical analysis on mutation rates across different TCGA cancer types and healthy germline genomes.
Zone 3 (heterochromatin) has 1.68-fold more somatic mutations than Zone 1 (euchromatin).
The mutation distribution is consistent across all 15 TCGA cancer types tested.
Zone 3 shows a 1.94-fold enrichment for CpG sites linked to epigenetic aging.

Abstract

Regional variation in somatic mutation rates across cancer genomes is strongly associated with chromatin organization, but existing predictive models require experimental epigenomic data such as ChIP-seq or Hi-C. Here I show that a single formula operating on raw DNA sequence — a 4-mer frequency skewness statistic termed Ev — classifies genomic windows into chromatin-like zones without any experimental input, training data, or reference databases. Applying Ev to 582,028 non-overlapping 5-kb windows across the human genome, I find that the lowest-scoring zone (Zone 3, corresponding to heterochromatin) carries 1.68-fold more somatic mutations than the highest-scoring zone (Zone 1, euchromatin) in 992 TCGA breast cancer genomes (odds ratio = 1.682, P T transitions consistent with SBS1 (5-methylcytosine deamination; OR = 1.24, P = 8.7 × 10⁻²⁰), and is independently confirmed by H3K4me3 ChIP-seq enrichment analysis (3.45-fold, P = 7.2 × 10⁻²²³). Zone 3 vulnerability is constitutional: it is present in healthy germline genomes from 2,504 individuals (Z3/Z1 = 1.13, P = 1.5 × 10⁻⁸²) and amplified 1.38-fold in cancer. Zone 3 is enriched 1.94-fold (P = 2.0 × 10⁻⁵) for the CpG sites composing Horvath's multi-tissue epigenetic aging clock, specifically those that gain methylation with age, linking cancer mutation geography and epigenetic aging through shared sequence-encoded chromatin vulnerability. These results demonstrate that the regional distribution of somatic mutations across cancer genomes can be predicted from DNA sequence composition alone, without recourse to experimental epigenomics. Version 8 corrects errors in earlier versions: TSG/oncogene zone segregation OR revised from 9.2 to 3.20 (P = 0.259, not significant); significant chromosomes corrected from 21/23 to 20/23.

A single formula predicts where cancer mutations and epigenetic aging converge from DNA sequence alone

Key Points

Abstract

Cite This Study