Quantile estimation in data streams is a fundamental task in data analysis. Data structures named quantile sketches are constructed to summarize data and estimate quantiles using limited storage. Duplicate data records often occur in real-world data, while there is no relevant design or analysis in existing quantile sketches, including the best known method KLL sketch. In this paper, we present DupliSketch, a quantile sketch with optimizations for duplicates. The sketch compresses arrival duplicates into elements and performs randomized compaction operations to meet the space limit. To better summarize heavy-tailed data, DupliSketch maintains values frequent enough dynamically and counts them separately, while summarizes other values with weighted elements obtained from compactions. Analyses show its error in rank estimation, space complexity to provide error guarantees under input with duplicates, and time complexity. The approach is deployed as a user-defined function in a database Apache IoTDB. Extensive experiments support the theoretical analyses and compare DupliSketch with existing methods on real and synthetic datasets. On average, the error of the DupliSketch is 44% smaller than that of the baseline methods under the same space limit.
Xia et al. (Thu,) studied this question.