What question did this study set out to answer?

This paper aims to develop a framework for auditing the fairness and interpretability of prediction models used in emergency departments.

June 13, 2026Open Access

Multicriteria Adjustment Fairness Framework: Measurement, Mitigation, and Interpretability in Emergency Department Prediction

Key Points

This paper aims to develop a framework for auditing the fairness and interpretability of prediction models used in emergency departments.
Applied the framework to 126,819 ED encounters from MIMIC-IV-ED.
Compared model performance using penalized logistic regression and random forest under various intervention strategies.
Evaluated subgroup fairness using metrics like equalized odds difference (EOD) and expected calibration error (ECE).
Random forest showed AUROC of 0.748 ± 0.028 and penalized logistic regression 0.746 ± 0.028.
Reweighting reduced EOD-based classification disparity by 80.6% for random forest and 86.4% for penalized logistic regression.
Multicalibration consistently reduced ECE-based calibration disparity but did not improve EOD-based disparities.

Abstract

Algorithmic prediction models are increasingly used to support decision-making in high-stakes environments, including emergency departments (ED). However, aggregate performance metrics may obscure systematic differences in classification errors or calibration across subgroups. This study presents a stage-wise, multi-metric, and interpretable fairness auditing framework for ED prediction. The framework compares mitigation strategies across data-, model-, and decision-level interventions, evaluates subgroup fairness using complementary classification and calibration criteria including equalized odds difference (EOD) and expected calibration error (ECE) disparity, and incorporates interpretability analyses based on SHapley Additive exPlanations (SHAP) and the calibration adjustment difference (CAD) to characterize changes in feature-contribution patterns and subgroup-specific probability adjustments after mitigation. The framework was applied to 126,819 ED encounters from MIMIC-IV-ED using measurements recorded within the first 2 h after arrival, and penalized logistic regression and random forest models were compared under reweighting, reduction, and multicalibration. Baseline AUROC values were 0.748 ± 0.028 for random forest and 0.746 ± 0.028 for penalized logistic regression. Reduction and multicalibration largely preserved discrimination performance, whereas reweighting was associated with reduced AUROC and AUPRC. Reweighting most clearly reduced EOD-based classification disparity, particularly for age, yielding reductions of 80.6% in random forest and 86.4% in penalized logistic regression. By contrast, multicalibration most consistently reduced ECE-based calibration disparity for sex and age but did not consistently improve EOD-based classification disparity. In the interpretability analyses, SHAP indicated that data- and model-level mitigation altered feature-contribution patterns, whereas CAD showed that decision-level mitigation produced subgroup-specific probability adjustments that varied in direction and magnitude across groups. These findings reveal trade-offs among discrimination performance, classification fairness, and calibration fairness, indicating that fairness mitigation should be guided by a clearly defined target fairness objective. Pre-deployment fairness auditing should therefore combine complementary fairness metrics with interpretability analyses to evaluate both subgroup-level outcomes and unintended changes in model behavior.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Shin et al. (Thu,) studied this question.

synapsesocial.com/papers/6a2cf45afaef96ed7f056995 https://doi.org/https://doi.org/10.3390/math14122085

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper