What question did this study set out to answer?

To address quantile estimation challenges related to duplicate data in streams using a new sketching method.

April 10, 2026Open Access

Quantile Estimation with Duplicates

Key Points

To address quantile estimation challenges related to duplicate data in streams using a new sketching method.
Development of DupliSketch to handle duplicates in quantile estimation
Compression of duplicates into single elements
Dynamic maintenance of frequent values with separate counting
Utilization of weighted elements from compactions
Deployment as a user-defined function in Apache IoTDB
DupliSketch shows 44% smaller error in rank estimation compared to baseline methods
Provides error guarantees under input with duplicates
Demonstrates efficient space complexity for summarizing heavy-tailed data
Experimental results validate theoretical analysis against existing methods on diverse datasets

Abstract

Quantile estimation in data streams is a fundamental task in data analysis. Data structures named quantile sketches are constructed to summarize data and estimate quantiles using limited storage. Duplicate data records often occur in real-world data, while there is no relevant design or analysis in existing quantile sketches, including the best known method KLL sketch. In this paper, we present DupliSketch, a quantile sketch with optimizations for duplicates. The sketch compresses arrival duplicates into elements and performs randomized compaction operations to meet the space limit. To better summarize heavy-tailed data, DupliSketch maintains values frequent enough dynamically and counts them separately, while summarizes other values with weighted elements obtained from compactions. Analyses show its error in rank estimation, space complexity to provide error guarantees under input with duplicates, and time complexity. The approach is deployed as a user-defined function in a database Apache IoTDB. Extensive experiments support the theoretical analyses and compare DupliSketch with existing methods on real and synthetic datasets. On average, the error of the DupliSketch is 44% smaller than that of the baseline methods under the same space limit.

Quantile Estimation with Duplicates

Key Points

Abstract

Cite This Study