Low-coverage whole-genome sequencing (lcWGS) has emerged as a cost-effective and robust approach for population genomic studies. Despite its advantages, publicly available resources for large-scale lcWGS datasets remain limited. To our knowledge, there is yet a bioinformatics tool capable of directly simulating lcWGS datasets from variant call format (VCF) files. To address this gap, we developed a tool called lcSimVCF, which leverages multivariate Gaussian mixture models (MGMMs) to simulate lcWGS genotype likelihood distributions directly from high-coverage whole-genome sequencing (hcWGS) VCF files. In this study, we introduced a tool called lcSimVCF that aim to use 30× hcWGS VCF files to simulate 1× genotype likelihood distributions as a demonstration. We trained an MGMM framework for three genotypes: homozygous reference, heterozygous, and homozygous non-reference, each simulating its respective lcWGS genotype likelihood distribution. Our results demonstrate the robustness of the MGMM approach in capturing genotype likelihood distributions compared to single Gaussian mixture models (SGM). Simulated lcWGS data exhibited representative patterns in population stratification analyses and showed potential for applications in polygenic risk score (PRS) modeling. Analysis of 81, 271, 745 imputed single nucleotide polymorphisms (SNPs) revealed strong correlations among the PRS derived from different phenotypes: PRSCAD, PRST2D, and PRSAF. The correlations demonstrated high coefficients of determination (R^2) of 0. 94, 0. 87, and 0. 85, respectively. Furthermore, we assessed simulation time across various configurations (1-16 CPUs, 1-800 individuals, 100-10, 000 variants), finding its capability in simulating 10, 000 variants across 800 individuals in under 30 seconds with 16 CPUs. The developed simulation tool has demonstrated its capability in generating large-scale lcWGS datasets. This tool holds significant potential to facilitate the development and evaluation of bioinformatics tools and analytical pipelines that rely on access to extensive lcWGS data resources.
Chen et al. (Wed,) studied this question.