Large-scale language model training requires robust and reproducible data preprocessing. While fairseq provides efficient training routines for RoBERTa models, preparing high-quality, language-specific data remains complex. We present modular Snakemake-based workflows for large-scale language model preparation, covering filtering, GPT-2 BPE tokenization, and fairseq-compatible data generation. The pipelines support new and existing tokenizers, enable scalable HPC parallelism, and include utilities for converting trained models to the Huggingface format. Bundled with a fairseq fork supporting GPU clusters and Cloud TPUs, the framework has been used to train GottBERT, GeistBERT, ChristBERT, PortBERT, SindBERT, and HalleluBERT, and generalizes into a reusable preprocessing infrastructure. • Modular Snakemake framework for large-scale RoBERTa pre-processing. • Supports high-precision corpus filtering and language-specific BPE. • Includes a fairseq fork with TPU v3/v4 support and Whole Word Masking (Huggingface tokenizers on GPU). • Provides utilities for Huggingface conversion and log monitoring. • Applied to pre-process and train GottBERT, GeistBERT, ChristBERT, PortBERT, SindBERT and HalleluBERT.
Raphael Schmitt (Sun,) studied this question.