Minimizer sketches summarize sequences with the main purpose of keeping the smallest footprint to be used for the sequence comparison of multiple reads: they are based on the notion of the smallest lexicographic k-mer in a window. Building on the concept of Lyndon factorization and a compact representation of sequences, called fingerprints, that correspond to the length of the factors in the factorization, we extend the notion of minimizer sketches to read fingerprints. By leveraging the conservation property of Lyndon factorization, we propose a novel approach for a fast comparison of long reads, to detect overlapping read pairs. An experimental evaluation of assemblies produced using the overlaps computed by our approach shows that it is competitive with the state-of-the-art tool minimap2 in terms of quality, while being up to 5 times faster at higher coverage levels.
Masri et al. (Mon,) studied this question.