A data lake maintains a large amounts of heterogeneous data with different data schemas and query interfaces. Efficiently querying and analyzing the heterogeneous data enables users to gain more complete insights. In this paper, we study a novel problem of distributed keyword search across heterogeneous data sources. Traditional distributed search algorithms generally require the predefined crossing edges connecting relevant data instances for communication between different sources, which is unpractical for the data lake due to the schema heterogeneity. To effectively perform keyword search over the data lake, we first introduce canonical graphs and then develop a best-first search algorithm called UnifySea, which explores the answers across different sources based on the unified identification of related instances. To further improve the query efficiency, we propose a novel incremental keyword search algorithm called DistSea, which just need to identify the promising relevant data between different sources. DistSea incrementally calculates the optimal answers based on locally partial evaluation. Equipped with several efficient pruning rules, DistSea reduces unpromising tree calculation across different sources. Experimental evaluations on six real-world benchmarks demonstrate the effectiveness, efficiency and scalability of the proposed algorithms.
Building similarity graph...
Analyzing shared references across papers
Loading...
Qin et al. (Tue,) studied this question.
synapsesocial.com/papers/68f984011881b68f3b7ae5c4 — DOI: https://doi.org/10.1145/3772001
Yuan Qin
Beijing Institute of Technology
Ye Yuan
Beijing Institute of Technology
Zhenyu Wen
Zhejiang University of Technology
ACM transactions on office information systems
Beijing Institute of Technology
Zhejiang University of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...