Due to the cost and complexity of data collection in biomedical domains, it is a common practice to combine data from multiple sites to obtain large datasets required for machine learning. However, undesired site-specific variability presents challenges. Data harmonization aims to address this issue by removing site-specific variance while preserving biologically relevant information. We show that the widely used ComBat-based harmonization improvements are driven by data leakage due to illicit use of target information when class labels are imbalanced across sites, a common scenario in biomedical domains. We propose a novel approach, PrettYharmonize, which leverages subtle differences in data harmonized using different pretended target values. Using controlled benchmark datasets and real-world magnetic resonance imaging and clinical ICU data, we demonstrate that our leakage-free PrettYharmonize method achieves performance comparable to leakage-prone methods. As such, it is a viable method to integrate ComBat-based methods in machine learning applications. • In class imbalance across sites, ComBat-based harmonization requires test labels to preserve relevant variance to avoid signal loss, leading to data leakage. • If no test labels are provided in such scenarios, ComBat-based harmonization removes the signal of interest. • PrettYharmonize enables integration of harmonization in ML pipelines in a leakage-free way, by eliminating the need for test targets while preserving predictive signals.
Nieto et al. (Sun,) studied this question.