A data lake maintains a large amounts of heterogeneous data with different data schemas and query interfaces. Efficiently querying and analyzing the heterogeneous data enables users to gain more complete insights. In this paper, we study a novel problem of distributed keyword search across heterogeneous data sources. Traditional distributed search algorithms generally require the predefined crossing edges connecting relevant data instances for communication between different sources, which is unpractical for the data lake due to the schema heterogeneity. To effectively perform keyword search over the data lake, we first introduce canonical graphs and then develop a best-first search algorithm called UnifySea, which explores the answers across different sources based on the unified identification of related instances. To further improve the query efficiency, we propose a novel incremental keyword search algorithm called DistSea, which just need to identify the promising relevant data between different sources. DistSea incrementally calculates the optimal answers based on locally partial evaluation. Equipped with several efficient pruning rules, DistSea reduces unpromising tree calculation across different sources. Experimental evaluations on six real-world benchmarks demonstrate the effectiveness, efficiency and scalability of the proposed algorithms.
Qin et al. (Tue,) studied this question.