High-throughput sequencing datasets frequently exhibit extreme read depth variation, biasing downstream analysis. Normalising coverage to a specific depth cap is important, yet existing tools rely on computationally expensive fetch-based or non-deterministic greedy algorithms. Here, we present a new coordinate-sorted sweep-line algorithm implemented in the open-source software rasusa that enforces a strict coverage cap at every genomic position. By utilising seeded random priority assignment, we achieve unbiased, reproducible read selection. The algorithm reduces runtimes by over 1,400-fold compared to legacy fetch-based methods—slashing processing from hours to mere seconds—and operates roughly four times faster than VariantBam. Furthermore, it requires only 8 MB of memory for long-read data. This provides a highly efficient, scalable, and reproducible solution for sequencing coverage normalisation.
Furqon et al. (Fri,) studied this question.