Algorithmic prediction models are increasingly used to support decision-making in high-stakes environments, including emergency departments (ED). However, aggregate performance metrics may obscure systematic differences in classification errors or calibration across subgroups. This study presents a stage-wise, multi-metric, and interpretable fairness auditing framework for ED prediction. The framework compares mitigation strategies across data-, model-, and decision-level interventions, evaluates subgroup fairness using complementary classification and calibration criteria including equalized odds difference (EOD) and expected calibration error (ECE) disparity, and incorporates interpretability analyses based on SHapley Additive exPlanations (SHAP) and the calibration adjustment difference (CAD) to characterize changes in feature-contribution patterns and subgroup-specific probability adjustments after mitigation. The framework was applied to 126,819 ED encounters from MIMIC-IV-ED using measurements recorded within the first 2 h after arrival, and penalized logistic regression and random forest models were compared under reweighting, reduction, and multicalibration. Baseline AUROC values were 0.748 ± 0.028 for random forest and 0.746 ± 0.028 for penalized logistic regression. Reduction and multicalibration largely preserved discrimination performance, whereas reweighting was associated with reduced AUROC and AUPRC. Reweighting most clearly reduced EOD-based classification disparity, particularly for age, yielding reductions of 80.6% in random forest and 86.4% in penalized logistic regression. By contrast, multicalibration most consistently reduced ECE-based calibration disparity for sex and age but did not consistently improve EOD-based classification disparity. In the interpretability analyses, SHAP indicated that data- and model-level mitigation altered feature-contribution patterns, whereas CAD showed that decision-level mitigation produced subgroup-specific probability adjustments that varied in direction and magnitude across groups. These findings reveal trade-offs among discrimination performance, classification fairness, and calibration fairness, indicating that fairness mitigation should be guided by a clearly defined target fairness objective. Pre-deployment fairness auditing should therefore combine complementary fairness metrics with interpretability analyses to evaluate both subgroup-level outcomes and unintended changes in model behavior.
Shin et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: