What type of study is this?

This is a Experimental Study study.

October 23, 2025Open Access

Exploring Heterogeneous Data Lake based on Canonical Graphs

Puntos clave

Keyword search improves across data sources in heterogeneous data lakes, enhancing user insights.
The innovative approach introduces new search algorithms to handle diverse data schemas effectively.
Proposed methods showcase significant improvements in search efficiency and scalability through experimental evaluations.
These findings may enable more effective querying methods in future heterogeneous data lake implementations.

Resumen

A data lake maintains a large amounts of heterogeneous data with different data schemas and query interfaces. Efficiently querying and analyzing the heterogeneous data enables users to gain more complete insights. In this paper, we study a novel problem of distributed keyword search across heterogeneous data sources. Traditional distributed search algorithms generally require the predefined crossing edges connecting relevant data instances for communication between different sources, which is unpractical for the data lake due to the schema heterogeneity. To effectively perform keyword search over the data lake, we first introduce canonical graphs and then develop a best-first search algorithm called UnifySea, which explores the answers across different sources based on the unified identification of related instances. To further improve the query efficiency, we propose a novel incremental keyword search algorithm called DistSea, which just need to identify the promising relevant data between different sources. DistSea incrementally calculates the optimal answers based on locally partial evaluation. Equipped with several efficient pruning rules, DistSea reduces unpromising tree calculation across different sources. Experimental evaluations on six real-world benchmarks demonstrate the effectiveness, efficiency and scalability of the proposed algorithms.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Qin et al. (Tue,) studied this question.

synapsesocial.com/papers/68f984011881b68f3b7ae5c4 — DOI: https://doi.org/10.1145/3772001

Authors

Yuan Qin

Beijing Institute of Technology

Ye Yuan

Beijing Institute of Technology

Zhenyu Wen

Zhejiang University of Technology

Journals

ACM transactions on office information systems

Actions

Institutions

Beijing Institute of Technology

Zhejiang University of Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Exploring Heterogeneous Data Lake based on Canonical Graphs

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion