BACKGROUND Data linkage in pharmacoepidemiological research is commonly employed to ascertain exposure and outcomes, or to obtain more information about confounding variables. However, to protect patient confidentiality usually unique patient identifiers are not provided; thus, makes data linkage between various sources challenging. The Saudi Real-Evidence Researches Network (RERN) aggregates EHRs from various hospitals, which may require a robust linkage technique. OBJECTIVE To evaluate and compare the performance of deterministic, probabilistic, and machine learning (ML) approaches for linking de-identified multiple sclerosis (MS) patient data from the RERN and Ministry of National Guard Health Affairs (MNGHA) EHR systems. METHODS We applied a simulation-based validation framework before linking real-world data sources. Deterministic linkage was based on predefined rules, while probabilistic linkage was based on a similarity-score matching. We applied both similarity-score and classification approach in ML¬¬¬¬— models including neural networks, logistic regression, and random forest. Performance of each approach was assessed using confusion matrix focusing on sensitivity, positive predictive value (PPV), F1-score, and computational efficiency. RESULTS Linkage of records for 2,247 MS patients (spanning 2016 to 2023) demonstrated that deterministic methods achieved an F1-score of 97.2% with match rates ranging from 46.6% to 86.6%. Probabilistic linkage produced a mean F1-score of 93.9% and identified between 65.5% and 95.4% of matched pairs. In contrast, ML approaches reached accuracies of up to 99.37% but at the cost of higher computational demands and match rates between 35.1% and 89.6%. CONCLUSIONS Probabilistic linkage offers high linkage capacity by recovering matches missed by deterministic methods, proving to be both flexible and efficient method, especially in real-world scenarios where unique identifiers are lacking. Probabilistic linkage achieved a great balance between recall and precision, enabling better integration of various data sources that could be useful in MS research.
Almadani et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: