Regional variation in somatic mutation rates across cancer genomes is strongly associated with chromatin organization, but existing predictive models require experimental epigenomic data such as ChIP-seq or Hi-C. Here I show that a single formula operating on raw DNA sequence — a 4-mer frequency skewness statistic termed Ev — classifies genomic windows into chromatin-like zones without any experimental input, training data, or reference databases. Applying Ev to 582,028 non-overlapping 5-kb windows across the human genome, I find that the lowest-scoring zone (Zone 3, corresponding to heterochromatin) carries 1.68-fold more somatic mutations than the highest-scoring zone (Zone 1, euchromatin) in 992 TCGA breast cancer genomes (odds ratio = 1.682, P T transitions consistent with SBS1 (5-methylcytosine deamination; OR = 1.24, P = 8.7 × 10⁻²⁰), and is independently confirmed by H3K4me3 ChIP-seq enrichment analysis (3.45-fold, P = 7.2 × 10⁻²²³). Zone 3 vulnerability is constitutional: it is present in healthy germline genomes from 2,504 individuals (Z3/Z1 = 1.13, P = 1.5 × 10⁻⁸²) and amplified 1.38-fold in cancer. Zone 3 is enriched 1.94-fold (P = 2.0 × 10⁻⁵) for the CpG sites composing Horvath's multi-tissue epigenetic aging clock, specifically those that gain methylation with age, linking cancer mutation geography and epigenetic aging through shared sequence-encoded chromatin vulnerability. These results demonstrate that the regional distribution of somatic mutations across cancer genomes can be predicted from DNA sequence composition alone, without recourse to experimental epigenomics. Version 8 corrects errors in earlier versions: TSG/oncogene zone segregation OR revised from 9.2 to 3.20 (P = 0.259, not significant); significant chromosomes corrected from 21/23 to 20/23.
Aditya Tiwari (Fri,) studied this question.