Finding relevant datasets is critical in any data pipeline but becomes challenging when data lacks schemas or metadata, as in data lakes. This makes it hard to identify the joins needed to produce the desired dataset. In query-by-example (QbE) join discovery, users provide a query table with a few example values, aiming to find joins from data lake tables that produce datasets containing those examples. Current QbE methods rely only on syntactic similarity, while semantic join discovery methods do not support QbE interfaces that work with limited example values. Moreover, existing QbE join path discovery methods (1) assume that the matching tables are directly joinable with each other, whereas in practice, a join path might contain intermediate tables that don't match the query table; and (2) do not ensure that the example tuples are contained in the returned joined table. We propose SemDisc , an end-to-end join discovery system that provides (1) discovery of hybrid join paths using both equi-join and semantic joins across data lake tables, (2) produces join paths that may include intermediate tables that do not overlap with the query tables but are needed to build high-quality joins, and (3) ensures the returned tuples are semantically similar to the ones in the provided examples. SemDisc supports efficient querying of joinable tables using an index that keeps track of high-quality join paths. Our evaluation across diverse workloads and datasets shows that SemDisc yields an average precision of over 0.86 in finding the correct join paths across various benchmarks, which is more than a 3x improvement over state-of-the-art join discovery methods.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mohammad et al. (Thu,) studied this question.
synapsesocial.com/papers/69d893c96c1944d70ce04c20 — DOI: https://doi.org/10.1145/3786682
Mir Mahathir Mohammad
University of Utah
El Kindi Rezig
University of Utah
Proceedings of the ACM on Management of Data
University of Utah
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: