March 3, 2026Open Access

Measuring Risk of Re-identification for a Nonprobability Sample Using a General Reference Sample

Key Points

Re-identification risk is assessed in non-probability samples, highlighting its significance.
Incorporating a probability-based reference sample allows for more accurate estimates of population parameters.
Simulation studies complement real applications, illustrating the proposed methods for practical use.
This framework supports understanding risks involved in using non-probability samples, potentially influencing data protection policies.

Abstract

Estimating the risk of re-identification probabilistically is well-developed for the case of a random representative sample drawn from the general population, such as large-scale government surveys conducted regularly at National Statistical Institutes. Recent work extended this procedure to assess the risk of re-identification in non-probability subpopulation registers such as a cancer register. In this paper, we extend this work further to the case of samples drawn from registers or more generally to non-probability samples, such as those used in opt-in panels at survey organizations. The assumption is that membership to the subpopulation register is not known and the sampling mechanism is also unknown. We show how to assess the risk of re-identification for these types of non-probability samples using a probability-based reference sample to infer population parameters under the probabilistic modelling framework. We demonstrate with a simulation study and a real application on the 2021 Survey of Doctoral Recipients drawn from a subpopulation register of all PhD recipients from an accredited US institution.

Measuring Risk of Re-identification for a Nonprobability Sample Using a General Reference Sample

Key Points

Abstract

Cite This Study