Ensuring fairness in clinical machine learning is a major concern, yet the dominant driver of unequal performance across sex groups remains unclear: is it the dataset or the algorithm. We conducted a systematic fairness evaluation across three healthcare domains—wearable physiology (MHEALTH), cardiac risk prediction (UCI Heart Disease), and stroke assessment—using ten widely used classifiers and three controlled sex-ratio sampling scenarios (50/50, 90/10, 10/90) under an identical analytical pipeline. Gender accuracy gaps varied markedly across datasets and exhibited dataset-specific patterns that did not generalize across clinical domains. Mixed-effects interaction modelling showed that the same algorithm could display negligible bias in one dataset and substantial bias in another. Variance contribution decomposition of the absolute Gender Accuracy Gap (∣GAG∣) indicated that dataset identity accounted for most of the observed variability (63.4%), with additional contribution from dataset–algorithm interactions (17.2%); algorithm choice alone explained 9.7%, whereas sampling scenario contributed negligibly (0.2%). Balanced sampling reduced disparities but did not eliminate them, consistent with residual sex-associated signal/feature structure beyond representation imbalance. These findings demonstrate that fairness in healthcare machine learning is primarily dataset-dependent, motivating dataset- and context-specific auditing before clinical deployment.
Elgendi et al. (Wed,) studied this question.