October 20, 2025Open Access

On the Complexity of Finding Approximate LCS of Multiple Strings

Key Points

Efficient algorithms were developed for finding approximate longest common substrings under specific distance constraints.
Run times achieved by the algorithms include ${ ext{O}(N^2)}$, ${k ext{ } ext{O}( ext{l} ext{l} N^2)}$, and ${mN ext{ } ext{O}( ext{l} ext{l} ext{ } ext{log}^k ext{l})}$, optimizing substring discovery in diverse strings.
The general approximation problems explored are NP-hard, necessitating efficient restrictively applicable solutions.
The study presents foundational bounds under the Strong Exponential Time Hypothesis, advancing the understanding of complex string comparisons.

Abstract

Finding an Approximate Longest Common Substring (ALCS) within a given set S=\s₁, s₂, , sₘ\ of m 2 strings is a key problem in computational biology, such as identifying related mutations across multiple genetic sequences. We study several variants of ALCS problems that, given integers k and t m, seek the longest string u -- or the longest substring u of any string in S -- that lies within distance k of at least one substring in t distinct strings from S. While the general problems are NP-hard, we present efficient algorithms for restricted cases under Hamming and edit distances using the LCPₖ and k-errata tree data structures. Our methods achieve run times of O (N²), O (k N²), and O (mNᵏ), where is the length of the longest string and N is the sum of the lengths of all the strings in S. We also establish conditional lower bounds under the Strong Exponential Time Hypothesis and extend our study to indeterminate strings.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper