What question did this study set out to answer?

To develop a reproducible framework for preprocessing large-scale language model data, specifically for RoBERTa.

February 28, 2026Open Access

From raw text to fairseq RoBERTa: A modular snakemake-based framework enabling language-specific BPE tokenization

Key Points

To develop a reproducible framework for preprocessing large-scale language model data, specifically for RoBERTa.
Developed modular Snakemake workflows for preprocessing tasks.
Implemented filtering and GPT-2 BPE tokenization methods.
Designed fairseq-compatible data generation flows for training models.
Integrated support for high-performance computing (HPC) parallelism and TPU compatibility.
The framework successfully processed and trained language-specific models such as GottBERT and GeistBERT.
High-precision corpus filtering and tokenization were achieved for better model performance.
Utilities for model conversion to Huggingface format were provided, enhancing usability.

Abstract

Large-scale language model training requires robust and reproducible data preprocessing. While fairseq provides efficient training routines for RoBERTa models, preparing high-quality, language-specific data remains complex. We present modular Snakemake-based workflows for large-scale language model preparation, covering filtering, GPT-2 BPE tokenization, and fairseq-compatible data generation. The pipelines support new and existing tokenizers, enable scalable HPC parallelism, and include utilities for converting trained models to the Huggingface format. Bundled with a fairseq fork supporting GPU clusters and Cloud TPUs, the framework has been used to train GottBERT, GeistBERT, ChristBERT, PortBERT, SindBERT, and HalleluBERT, and generalizes into a reusable preprocessing infrastructure. • Modular Snakemake framework for large-scale RoBERTa pre-processing. • Supports high-precision corpus filtering and language-specific BPE. • Includes a fairseq fork with TPU v3/v4 support and Whole Word Masking (Huggingface tokenizers on GPU). • Provides utilities for Huggingface conversion and log monitoring. • Applied to pre-process and train GottBERT, GeistBERT, ChristBERT, PortBERT, SindBERT and HalleluBERT.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Raphael Schmitt (Sun,) studied this question.

synapsesocial.com/papers/69a285da0a974eb0d3c00bb5 https://doi.org/https://doi.org/10.1016/j.simpa.2026.100824

Bookmark

View Full Paper