Over the past decade, the spread of digital technologies has pushed us firmly into the era of Big Data, with information growing at a pace and scale we have never seen before. In healthcare, for instance, organizations now have to handle everything from transactional logs and sensor readings to free-text notes and multimedia files. While these massive datasets hold huge potential for deeper insights, they also bring serious challenges when it comes to integrating them. Data collected separately often contains typos, duplicate records, inconsistent formats, missing details, and, most important, no shared identifier to link it all together. Poor data integration can have serious consequences in the real world. Several clinics and hospitals may disperse patients' medical histories, potentially leading to redundant testing, incorrect diagnoses, and compromised patient safety. Entity matching, also known as record linkage or entity resolution, is the process of identifying which records in one or more datasets refer to the same real-world entity. Early deterministic methods—which call for perfect agreement on one or more crucial fields—proved to be too fragile and broke down when records included formatting or spelling variations. Through the usage of string-similarity linkage, probabilistic linkage models enable fuzzy comparisons by allocating weights to classify them for agreement and disagreement on each attribute. However, because a naïve all-pairs comparison grows quadratically with dataset size, these methods are difficult to use, and they can fail to scale beyond millions of records. Modern techniques aim to counteract the combinatorial explosion of candidate pairs. Record linkage can be extended for large enterprises by incorporating in-memory join and parallel processing, which, when deployed with cloud computing architectures like Apache Spark, can then return a convincing architectural solution. Connecting sensitive data across organizational boundaries introduces an additional layer of complexity. Privacy-preserving record linkage (PPRL) techniques provoke the need to match records, such as financial or medical identifiers, without addressing personal information 1 . Popular techniques include Bloom filters, secure multi-party computation, and phonetic hashing. Even with all these solutions, there are still significant gaps in the current developments. Many of these do not support multi-party scenarios involving data matching for three or more data sources and instead concentrate on simple one-to-one dataset matching. Furthermore, once sensitive matches share linkage outputs, they become very vulnerable because most frameworks do not adequately protect against inference attacks. To address these issues, in this dissertation we introduce a multi-party entity matching pipeline, which is built on PySpark and integrates privacy-enhancing noise injection, distributed similarity scoring, and phonetic encoding in an end-to-end process. The suggested system presented in this dissertation is designed to balance scalability, accuracy, and privacy by utilizing PPRL techniques and leveraging the scalability of Spark-based ER systems. Our pipeline is expected to achieve high matching quality, scale linearly with data volume, and offer a modular configuration interface for quick experimentation.
Κωνσταντίνος Α. Ράζγκελης (Wed,) studied this question.