June 29, 2009

Incremental maintenance of length normalized indexes for approximate string matching

Key Points

Key points are not available for this paper at this time.

Abstract

Approximate string matching is a problem that has received a lot of attention recently. Existing work on information retrieval has concentrated on a variety of similarity measures TF/IDF, BM25, HMM, etc.) specifically tailored for document retrieval purposes. As new applications that depend on retrieving short strings are becoming popular(e.g., local search engines like YellowPages.com, Yahoo!Local, and Google Maps) new indexing methods are needed, tailored for short strings. For that purpose, a number of indexing techniques and related algorithms have been proposed based on length normalized similarity measures. A common denominator of indexes for length normalized measures is that maintaining the underlying structures in the presence of incremental updates is inefficient, mainly due to data dependent, precomputed weights associated with each distinct token and string. Incorporating updates usually is accomplished by rebuilding the indexes at regular time intervals. In this paper we present a framework that advocates lazy update propagation with the following key feature: Efficient, incremental updates that immediately reflect the new data in the indexes in a way that gives strict guarantees on the quality of subsequent query answers. More specifically, our techniques guarantee against false negatives and limit the number of false positives produced. We implement a fully working prototype and illustrate that the proposed ideas work really well in practice for real datasets.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Marios Hadjieleftheriou

AT&T (United States)

Nick Koudas

AT&T (United States)

Divesh Srivastava

AT&T (United States)

Actions

Institutions

University of Toronto

AT&T (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Incremental maintenance of length normalized indexes for approximate string matching

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study