National Center for Biotechnology Information (NCBI) stores over 1 million bacterial genome sequences with no tools capable of estimating the prevalence of specific nucleotide sequences within or across taxa. To address this gap, we developed the Gene Taxonomic Prevalence (GeTPrev) pipeline. GeTPrev estimates the presence of user-specified genes in bacterial genome collections across taxa. Implemented in Bash, GeTPrev integrates BLAST-based sequence alignment with two operational modes tailored to different analytical needs. The default mode performs a one-pass search against curated complete genome databases formatted for BLAST. A “Heavy” mode expands the search to include both complete and draft genomes for a broader representation of genomic diversity. GeTPrev is managed through a Conda environment and designed for compatibility with high-performance computing (HPC) systems, enabling efficient batch analysis of large genome datasets. GeTPrev supports the construction of user-defined gene taxonomic targets. However, pre-built complete genome databases of seven Enterobacteriaceae genera are included with the pipeline to support rapid analysis. GeTPrev functionality and flexibility were demonstrated by nine example applications. GeTPrev offers a practical solution for gene-centric analysis in microbial genomics, molecular epidemiology, and food safety surveillance.
Wu et al. (Fri,) studied this question.