The evolution of SARS-CoV-2 is increasingly characterized by recombinant lineages with convergent mutation constellations that drive immune escape and transmissibility. Here, we present a data-driven, CLF for genomic surveillance by integrating mutation co-occurrence, lineage dynamics, and forward forecasting across 7.4 million spike protein sequences collected globally between 2020 and 2025. Using Jaccard similarity (n = 6.8M), Chi-square and mutual information (n = 158k), and Cramér’s V analysis (n = 79k), we identify non-random mutation clusters under coordinated selection, including S:D614G + S:T478K (Jaccard = 0.92) and S:N969K + S:Q954H (Cramér’s V = 0.702), indicating functional synergy in transmissibility and fusion stability. Lineage-specific analysis of Q1 2025 variants reveals dominance of KP.2, LB.1, and FL.1.5.1, all carrying S:L452W and S:F456L mutations predicted in silico via Markov chain modeling of co-evolving networks. These variants exhibit resistance to monoclonal antibodies, with combinations such as S:L452W + S:F456L linked to neutralization escape. Our forecasting model, trained on real-world transition probabilities, accurately anticipated the emergence of these high-fitness constellations months in advance. By integrating past observation, present validation, and future projection, we demonstrate that SARS-CoV-2 evolution follows predictable pathways shaped by epistatic interactions. This closed-loop model shifts genomic surveillance from reactive detection to proactive risk assessment, enabling earlier therapeutic updates and vaccine design. All data, code, and intermediate results are publicly available via separate doi.
Building similarity graph...
Analyzing shared references across papers
Loading...
Tahir Bhatti
University of Modern Sciences
Building similarity graph...
Analyzing shared references across papers
Loading...
Tahir Bhatti (Thu,) studied this question.
synapsesocial.com/papers/68d44a3031b076d99fa531eb — DOI: https://doi.org/10.20944/preprints202509.0925.v1