Reconstructing a genome from collections of short DNA fragments is a fundamental problem in modern sequencing. Although genome assembly algorithms are widely used in practice, the mathematical conditions that allow exact reconstruction are not always clear. This study develops a graph-theoretic framework for genome reconstruction using De Bruijn graphs and Eulerian paths in an idealized, error-free setting. Each k-mer is represented as a directed edge connecting its (k−1)-length prefix and suffix. The resulting overlap graph is constructed using a balanced search tree and traversed with a stack-based Eulerian algorithm. Numerical experiments over a broad range of genome lengths and fragment lengths reveal a sharp transition in reconstruction accuracy. This transition is explained by a probabilistic model for prefix collisions in the directed graph. The theoretical predictions agree with simulation results and provide conditions on the fragment length required for reliable reconstruction. These results show that the difficulty of genome assembly is governed primarily by the combinatorial structure of the underlying graph rather than by algorithmic heuristics.
Zhu et al. (Sat,) studied this question.