As the availability of public data continues to grow, protecting individual privacy remains a complex and unresolved challenge. Traditional anonymization methods often rely on removing direct identifiers such as names or ID numbers. However, this approach is often insufficient, especially when datasets retain quasi-identifiers like ZIP code, gender, and birthdate. When combined, these attributes can enable the re-identification of individuals, even in datasets considered anonymous. This bachelor’s thesis, Re-Identification Risks in Anonymized Data: A Replication of Sweeney’s Framework in the Context of Privacy Protection Strategies, investigates the extent of re-identification risk in Switzerland and discusses the role of established anonymization techniques and privacy models. Latanya Sweeney’s influential 2000 study demonstrated that 87% of the U.S. population could be uniquely identified using only three demographic attributes. Her findings have had a lasting impact on data privacy research and regulation. Building on this foundation, this thesis replicates her methodology and adds a national case study to the broader discussion on re-identification risks. Using 2023 population data from the Swiss Federal Statistical Office and geographic data from the Federal Office of Topography, ten experiments were conducted. These varied combinations of quasi-identifiers and generalization levels across ZIP codes, municipalities, and cantons. This thesis complements existing international studies by offering a detailed case study of Switzerland, a country with a highly fragmented population structure. It illustrates how structural factors at the national level influence privacy risks in anonymized data. The findings show that anonymized data remains highly vulnerable to re-identification. When detailed quasi-identifiers such as full birthdate and 4-digit ZIP code were used, 97.3% of individuals could be uniquely identified. However, generalizing birthdate to month and year and aggregating geographic data to the canton level reduced the re-identification risk to nearly zero. The analysis shows that re-identification risk is influenced more by subgroup size, demographic composition, and the precision of quasi-identifiers than by overall population size. These results underscore the limitations of relying only on structural anonymization strategies. To build on these findings, the thesis outlines established privacy models and anonymization techniques. Generalization, in particular, was used throughout the experiments to reduce identifiability by grouping values such as ZIP codes and birthdates. While other privacy protection strategies were not applied in the experiments, the thesis explores how they could address different types of risks and highlights their individual limitations. Syntactic models like k-anonymity offer protection under specific conditions, but only differential privacy provides formal guarantees that remain effective against adversarial background knowledge. These models, especially when implemented together, could offer stronger protection in more complex or high-risk data environments. The study concludes that effective anonymization requires a context-sensitive, multilayered approach. The Swiss case highlights how structural factors shape re-identification risk, and how the choice of anonymization strategy directly affects data protection outcomes. Anonymization remains an ongoing challenge that requires continued academic attention and practical policy solutions.
Gianna Tinner (Wed,) studied this question.