Abstract In the global effort to characterize biodiversity, short species-specific genomic sequences known as DNA barcodes enable fine-grained comparisons among organisms within the same kingdom of life. Although machine learning algorithms specifically designed for the analysis of DNA barcodes are becoming more popular, most existing methodologies rely on generic supervised training algorithms. We introduce BarcodeBERT, a family of models tailored to biodiversity analysis and trained exclusively on data from a reference library of 1.5 M invertebrate DNA barcodes. We evaluate BarcodeBERT on taxonomic identification tasks against a spectrum of machine learning approaches, including supervised training of classical neural architectures and fine-tuning of general DNA foundation models. Our self-supervised pretraining strategies on domain-specific data outperform fine-tuned foundation models, especially in identification tasks involving lower taxa such as genera and species. Compared with BLAST, a widely used sequence-search tool, BarcodeBERT achieves comparable species-level classification accuracy while being 55 times faster. Our analysis of masking and tokenization strategies also provides practical guidance for building customized DNA language models, emphasizing the importance of aligning model training strategies with dataset characteristics and domain knowledge. The code repository is available at https://github.com/bioscan-ml/BarcodeBERT.
Arias et al. (Thu,) studied this question.