September 9, 2005Open Access

Sampling to estimate arbitrary subset sums

Key Points

Key points are not available for this paper at this time.

Abstract

Starting with a set of weighted items, we want to create a generic sample of a certain size that we can later use to estimate the total weight of arbitrary subsets. For this purpose, we propose priority sampling which tested on Internet data performed better than previous methods by orders of magnitude. Priority sampling is simple to define and implement: we consider a steam of items i=0,. . . , n-1 with weights wᵢ. For each item i, we generate a random number rᵢ in (0, 1) and create a priority qᵢ=wᵢ/rᵢ. The sample S consists of the k highest priority items. Let t be the (k+1) th highest priority. Each sampled item i in S gets a weight estimate Wᵢ=maxwᵢ, t, while non-sampled items get weight estimate Wᵢ=0. Magically, it turns out that the weight estimates are unbiased, that is, EWᵢ=wᵢ, and by linearity of expectation, we get unbiased estimators over any subset sum simply by adding the sampled weight estimates from the subset. Also, we can estimate the variance of the estimates, and surpricingly, there is no co-variance between different weight estimates Wᵢ and Wⱼ. We conjecture an extremely strong near-optimality; namely that for any weight sequence, there exists no specialized scheme for sampling k items with unbiased estimators that gets smaller total variance than priority sampling with k+1 items. Very recently Mario Szegedy has settled this conjecture.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Nick Duffield

Mitchell Institute

Carsten Lund

University of Hagen

Mikkel Thorup

University of Copenhagen

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Duffield et al. (Fri,) studied this question.

synapsesocial.com/papers/6a212e2ba2a97f3a085ac7e6 — DOI: https://doi.org/10.48550/arxiv.cs/0509026

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Flow sampling under hard resource constraints· 2004 · 110 citations
Sampling with Unequal Probabilities· 2006 · 181 citations
An Efficient Method for Weighted Sampling without Replacement· 1980 · 98 citations
Equivalence between priority queues and sorting· 2003 · 30 citations
Random sampling with a reservoir· 1985 · 1,756 citations

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Flow sampling under hard resource constraints· 2004 · 110 citations
Sampling with Unequal Probabilities· 2006 · 181 citations
An Efficient Method for Weighted Sampling without Replacement· 1980 · 98 citations
Equivalence between priority queues and sorting· 2003 · 30 citations
Random sampling with a reservoir· 1985 · 1,756 citations

Sampling to estimate arbitrary subset sums

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider