Key points are not available for this paper at this time.
Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand. An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chuan Xiao
Beijing Institute of Technology
Wei Wang
Xihua University
Xuemin Lin
Shanghai Jiao Tong University
Proceedings - International Conference on Data Engineering
UNSW Sydney
Data61
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiao et al. (Sun,) studied this question.
synapsesocial.com/papers/6a0eadffb7cc3b883f229f19 — DOI: https://doi.org/10.1109/icde.2009.111