The SARS-CoV-2 virus, responsible for the COVID-19 pandemic, has shown considerable genetic variability across different regions of the world. Understanding this geographic genetic diversity is essential for tracking viral evolution, managing outbreaks, and informing vaccine strategies. This paper presents a novel approach that integrates dimensionality reduction techniques with neural network-based clustering to analyze the genomic sequences of SARS-CoV-2. Genome samples were selected from some of the most affected countries—including Spain, Italy, the United States, India, Brazil, and China—using public repositories. The genomic data are transformed into feature vectors using k-mer frequency representation, followed by dimensionality reduction using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) to retain essential patterns in a lower-dimensional space. A deep neural clustering model is then applied to uncover hidden structure in the data, revealing clusters that correspond to geographic and genetic distinctions. Experimental results demonstrate that the proposed framework effectively captures regional genetic variations of SARS-CoV-2 and provides insights into the evolution and spread of the virus across countries.
Sudhagar et al. (Thu,) studied this question.