July 17, 2023Open Access

A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health

Structured PICO

Does synthetic data generation produce consistent results compared to federated analysis for evaluating cardiovascular health across international jurisdictions?

Population

79,293 individuals from the Canadian Community Health Survey (CCHS) 2014 (n=63,522) and the Austria Health Interview Survey (ATHIS) 2014 (n=15,771)

Intervention

Synthetic data generation (SDG) of the Canadian dataset pooled with the real Austrian dataset

Comparator

Federated analysis on the original source datasets (DataSHIELD)

Outcome

Consistency of regression results (parameter estimates) between the two approaches for assessing country-level differences in the role of sex on cardiovascular health (CVH) using a modified CANHEART index

Synthetic data generation provides a highly efficient and privacy-preserving alternative to federated analysis for conducting international comparative studies in cardiovascular health.

Abstract

Sharing health data for research purposes across international jurisdictions has been a challenge due to privacy concerns. Two privacy enhancing technologies that can enable such sharing are synthetic data generation (SDG) and federated analysis, but their relative strengths and weaknesses have not been evaluated thus far. In this study we compared SDG with federated analysis to enable such international comparative studies. The objective of the analysis was to assess country-level differences in the role of sex on cardiovascular health (CVH) using a pooled dataset of Canadian and Austrian individuals. The Canadian data was synthesized and sent to the Austrian team for analysis. The utility of the pooled (synthetic Canadian + real Austrian) dataset was evaluated by comparing the regression results from the two approaches. The privacy of the Canadian synthetic data was assessed using a membership disclosure test which showed an F1 score of 0.001, indicating low privacy risk. The outcome variable of interest was CVH, calculated through a modified CANHEART index. The main and interaction effect parameter estimates of the federated and pooled analyses were consistent and directionally the same. It took approximately one month to set up the synthetic data generation platform and generate the synthetic data, whereas it took over 1.5 years to set up the federated analysis system. Synthetic data generation can be an efficient and effective tool for enabling multi-jurisdictional studies while addressing privacy concerns.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zahra Azizi

University of Ottawa

Simon David Lindner

Yumika Shiba

Journals

Scientific Reports

Actions

Institutions

McGill University

Karolinska Institutet

University of Alberta

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health

Structured PICO

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study