Abstract Microbial community profiling relies on comprehensive reference databases, yet full-length 16S rRNA amplicons remain sparse for many bacterial taxa. We present SGenerator, a neural network-based data augmentation method that generates biologically informative, full-length (1500 bp) 16S rRNA sequences for underrepresented genera. Combining time series forecasting and natural language processing, SGenerator uses an LSTM architecture with a sliding-window approach and n-gram segmentation to generate full-length amplicons. Trained on a subset of 2,289 sequences from 50 different unbalanced genera of a total of 184,732 high-quality sequences from the RiboGrove database, it produced 500 synthetic sequences per genus across 50 genera. BLASTn validation showed that an average of 300 sequences per genus closely matched native entries, and R2DT analysis confirmed that an average of 244 per genus folded into canonical 16S rRNA secondary structures, indicating strong biological fidelity. Classifiers trained on the augmented datasets achieved F1 and MCC scores of 0.90 on ITGDB and 0.75 on the more specialized MiDAS dataset, with k-mer embeddings slightly outperforming transformer-based representations. These results demonstrate that LSTM-driven sequence generation can effectively fill taxonomic gaps in full-length amplicon databases, overcome hypervariable region biases in short-read data, and has the potential to enhance microbial profiling accuracy in ecological studies.
Fernández et al. (Wed,) studied this question.