What type of study is this?

This is a Experimental Study study.

October 23, 2025Open Access

Exploring Heterogeneous Data Lake based on Canonical Graphs

Key Points

Keyword search improves across data sources in heterogeneous data lakes, enhancing user insights.
The innovative approach introduces new search algorithms to handle diverse data schemas effectively.
Proposed methods showcase significant improvements in search efficiency and scalability through experimental evaluations.
These findings may enable more effective querying methods in future heterogeneous data lake implementations.

Abstract

A data lake maintains a large amounts of heterogeneous data with different data schemas and query interfaces. Efficiently querying and analyzing the heterogeneous data enables users to gain more complete insights. In this paper, we study a novel problem of distributed keyword search across heterogeneous data sources. Traditional distributed search algorithms generally require the predefined crossing edges connecting relevant data instances for communication between different sources, which is unpractical for the data lake due to the schema heterogeneity. To effectively perform keyword search over the data lake, we first introduce canonical graphs and then develop a best-first search algorithm called UnifySea, which explores the answers across different sources based on the unified identification of related instances. To further improve the query efficiency, we propose a novel incremental keyword search algorithm called DistSea, which just need to identify the promising relevant data between different sources. DistSea incrementally calculates the optimal answers based on locally partial evaluation. Equipped with several efficient pruning rules, DistSea reduces unpromising tree calculation across different sources. Experimental evaluations on six real-world benchmarks demonstrate the effectiveness, efficiency and scalability of the proposed algorithms.

Exploring Heterogeneous Data Lake based on Canonical Graphs

Key Points

Abstract

Cite This Study