Many state-of-the-art tools for sequence analysis are based on alignment-free techniques to manage high-throughput processing. Several routine tasks such as querying, indexing, and similarity search are based on k-mer statistics. In order to accommodate errors or mutations, spaced seeds have been increasingly used instead of k -mers, enhancing sensitivity in various applications. However, spaced seed hashing is computationally intensive, introducing significant slowdown in the processing. This article addresses the challenge of efficient spaced seed hashing, which is functional for the computation of spaced k-mers counting. We present DuoHash, a framework that enables the efficient computation of hash functions for spaced seeds. DuoHash exploits an efficient spaced seed binary encoding and precomputed tables to speedup the computation of the hash value for both the forward and reverse strands of a DNA sequence. In our experiments, DuoHash substantially outperforms existing algorithms, achieving speedups of up to 11x on short reads with a spaced seed of medium density. Furthermore, we show the applicability of DuoHash to the problem of spaced k-mers counting. The code of DuoHash is available at https://github.com/CominLab/DuoHash/ .
Gemin et al. (Sat,) studied this question.