What does this research mean for the field?

PrettYharmonize is a leakage-free method for data harmonization in machine learning that preserves predictive signals without requiring test labels, addressing issues of data leakage in class imbalance across sites. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

Investigate the impact of data leakage in harmonization methods for machine learning under site-specific class imbalance.

February 28, 2026Open Access

Impact of leakage on data harmonization in machine learning pipelines in class imbalance across sites.

Puntos clave

Investigate the impact of data leakage in harmonization methods for machine learning under site-specific class imbalance.
Combined data from multiple biomedical sites for analysis.
Evaluated the performance of ComBat-based harmonization methods.
Developed PrettYharmonize, a leakage-free approach for data harmonization.
Utilized controlled benchmark datasets and real-world imaging data in research.
PrettYharmonize achieved performance similar to leakage-prone methods.
ComBat-based methods risked signal loss due to improper test label use.
Leakage from class imbalance across sites was identified as a significant issue.

Resumen

Due to the cost and complexity of data collection in biomedical domains, it is a common practice to combine data from multiple sites to obtain large datasets required for machine learning. However, undesired site-specific variability presents challenges. Data harmonization aims to address this issue by removing site-specific variance while preserving biologically relevant information. We show that the widely used ComBat-based harmonization improvements are driven by data leakage due to illicit use of target information when class labels are imbalanced across sites, a common scenario in biomedical domains. We propose a novel approach, PrettYharmonize, which leverages subtle differences in data harmonized using different pretended target values. Using controlled benchmark datasets and real-world magnetic resonance imaging and clinical ICU data, we demonstrate that our leakage-free PrettYharmonize method achieves performance comparable to leakage-prone methods. As such, it is a viable method to integrate ComBat-based methods in machine learning applications. • In class imbalance across sites, ComBat-based harmonization requires test labels to preserve relevant variance to avoid signal loss, leading to data leakage. • If no test labels are provided in such scenarios, ComBat-based harmonization removes the signal of interest. • PrettYharmonize enables integration of harmonization in ML pipelines in a leakage-free way, by eliminating the need for test targets while preserving predictive signals.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo