What does this research mean for the field?

The Gene Taxonomic Prevalence (GeTPrev) pipeline provides a scalable, efficient, and practical computational solution for estimating the prevalence of specific genes across large bacterial genome datasets. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study aims to develop a tool to estimate the prevalence of specific genes across bacterial taxa using genome sequences.

May 29, 2026Open Access

The Gene Taxonomic Prevalence (GeTPrev) Pipeline for Scalable Gene Prevalence Estimation Across Bacterial Taxa

Key Points

The study aims to develop a tool to estimate the prevalence of specific genes across bacterial taxa using genome sequences.
Developed the GeTPrev pipeline implemented in Bash for gene prevalence estimation.
Integrated BLAST-based sequence alignment with two operational modes: default for curated genomes and 'Heavy' for complete and draft genomes.
Managed GeTPrev through a Conda environment, ensuring compatibility with high-performance computing systems.
GeTPrev effectively estimates gene presence in large bacterial genome collections.
The pipeline supports rapid analysis with pre-built genome databases for seven genera of Enterobacteriaceae.
Demonstrated functionality through nine example applications in various microbial genomics fields.

Abstract

National Center for Biotechnology Information (NCBI) stores over 1 million bacterial genome sequences with no tools capable of estimating the prevalence of specific nucleotide sequences within or across taxa. To address this gap, we developed the Gene Taxonomic Prevalence (GeTPrev) pipeline. GeTPrev estimates the presence of user-specified genes in bacterial genome collections across taxa. Implemented in Bash, GeTPrev integrates BLAST-based sequence alignment with two operational modes tailored to different analytical needs. The default mode performs a one-pass search against curated complete genome databases formatted for BLAST. A “Heavy” mode expands the search to include both complete and draft genomes for a broader representation of genomic diversity. GeTPrev is managed through a Conda environment and designed for compatibility with high-performance computing (HPC) systems, enabling efficient batch analysis of large genome datasets. GeTPrev supports the construction of user-defined gene taxonomic targets. However, pre-built complete genome databases of seven Enterobacteriaceae genera are included with the pipeline to support rapid analysis. GeTPrev functionality and flexibility were demonstrated by nine example applications. GeTPrev offers a practical solution for gene-centric analysis in microbial genomics, molecular epidemiology, and food safety surveillance.

The Gene Taxonomic Prevalence (GeTPrev) Pipeline for Scalable Gene Prevalence Estimation Across Bacterial Taxa

Key Points

Abstract

Cite This Study