Quantile estimation in data streams is a fundamental task in data analysis. Data structures named quantile sketches are constructed to summarize data and estimate quantiles using limited storage. Duplicate data records often occur in real-world data, while there is no relevant design or analysis in existing quantile sketches, including the best known method KLL sketch. In this paper, we present DupliSketch, a quantile sketch with optimizations for duplicates. The sketch compresses arrival duplicates into elements and performs randomized compaction operations to meet the space limit. To better summarize heavy-tailed data, DupliSketch maintains values frequent enough dynamically and counts them separately, while summarizes other values with weighted elements obtained from compactions. Analyses show its error in rank estimation, space complexity to provide error guarantees under input with duplicates, and time complexity. The approach is deployed as a user-defined function in a database Apache IoTDB. Extensive experiments support the theoretical analyses and compare DupliSketch with existing methods on real and synthetic datasets. On average, the error of the DupliSketch is 44% smaller than that of the baseline methods under the same space limit.
Building similarity graph...
Analyzing shared references across papers
Loading...
Tianrui Xia
Tsinghua University
Ziling Chen
Tsinghua University
Shaoxu Song
Tsinghua University
Proceedings of the ACM on Management of Data
Tsinghua University
Building similarity graph...
Analyzing shared references across papers
Loading...
Xia et al. (Thu,) studied this question.
synapsesocial.com/papers/69d894526c1944d70ce05373 — DOI: https://doi.org/10.1145/3786683
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: