What question did this study set out to answer?

The aim is to generate full-length 16S rRNA sequences for underrepresented microbial taxa to enhance community profiling.

April 25, 2026Open Access

Bridging taxonomic gaps in microbial community profiling with LSTM-generated synthetic full-length 16S rRNA sequences

Key Points

The aim is to generate full-length 16S rRNA sequences for underrepresented microbial taxa to enhance community profiling.
Developed SGenerator, a neural network-based tool using LSTM architecture for data augmentation.
Trained on 2,289 sequences from 50 genera, producing 500 synthetic sequences per genus.
Utilized BLASTn and R2DT analyses to validate generated sequences against native entries.
Generated sequences showed an average of 300 sequences per genus closely matched existing entries.
R2DT analysis confirmed average folding of 244 sequences per genus into canonical 16S rRNA structures.
Classifiers trained on augmented data achieved F1 and MCC scores of 0.90 on ITGDB and 0.75 on MiDAS.

Abstract

Abstract Microbial community profiling relies on comprehensive reference databases, yet full-length 16S rRNA amplicons remain sparse for many bacterial taxa. We present SGenerator, a neural network-based data augmentation method that generates biologically informative, full-length (1500 bp) 16S rRNA sequences for underrepresented genera. Combining time series forecasting and natural language processing, SGenerator uses an LSTM architecture with a sliding-window approach and n-gram segmentation to generate full-length amplicons. Trained on a subset of 2,289 sequences from 50 different unbalanced genera of a total of 184,732 high-quality sequences from the RiboGrove database, it produced 500 synthetic sequences per genus across 50 genera. BLASTn validation showed that an average of 300 sequences per genus closely matched native entries, and R2DT analysis confirmed that an average of 244 per genus folded into canonical 16S rRNA secondary structures, indicating strong biological fidelity. Classifiers trained on the augmented datasets achieved F1 and MCC scores of 0.90 on ITGDB and 0.75 on the more specialized MiDAS dataset, with k-mer embeddings slightly outperforming transformer-based representations. These results demonstrate that LSTM-driven sequence generation can effectively fill taxonomic gaps in full-length amplicon databases, overcome hypervariable region biases in short-read data, and has the potential to enhance microbial profiling accuracy in ecological studies.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper