Key points are not available for this paper at this time.
Purpose: To evaluate the effectiveness and generalizability of bias mitigation methods in glaucoma progression prediction models across a multicenter electronic health records (EHRs) repository and to propose a novel evaluation metric that balances performance and fairness in clinical artificial intelligence (AI). Design: A cohort study. Participants: A total of 50 656 glaucoma patients drawn from seven participating institutions in the SOURCE consortium, a harmonized EHR repository spanning ophthalmology departments in the United States. Methods: We trained five model architectures (e.g., XGBoost, neural networks, and transformers) to predict progression to surgery. Each model was evaluated with and without five bias-mitigation methods across preprocessing, inprocessing, and postprocessing. Performance and fairness were assessed on 1 internal and 2 external test sets. We introduced FairOdds-AUC, a composite metric that adjusts area under the receiver operating curve (AUROC) by equalized odds gaps across sex and race/ethnicity. The FairOdds-AUC metric was implemented in Python and is available as an open-source package for reproducibility and future use. Main Outcome Measures: Area under the receiver operating curve, equalized odds for sex and race/ethnicity, and FairOdds-AUC. Results: Inprocessing methods, particularly inverse propensity weighting (IPW) and the adversarial fairness classifier, achieved more favorable fairness-performance tradeoffs than baseline and other mitigation approaches across all evaluation sets. For example, on the internal test set, IPW improved FairOdds-AUC from 0.562 (95% confidence interval 0.540, 0.581) to 0.600 (0.575, 0.629) for the transformer model and from 0.556 (0.534, 0.577) to 0.5922 (0.53, 0.61919) for a fully connected network, while maintaining essentially the same discrimination. Adversarial fairness classifier achieved the highest FairOdds-AUC in several settings (up to 0.613 0.595, 0.629 for the deep learning fully connected network) with substantial reductions in equalized odds difference for sex. Postprocessing and preprocessing bias mitigation strategies yielded more variable FairOdds-AUC changes (-0.009 to +0.021) and showed weaker generalizability across external sites. FairOdds-AUC consistently reflected the balance between AUROC and equalized odds, with the optimal mitigation strategy depending on fairness-utility priorities. Conclusions: Across a large, diverse glaucoma cohort, inprocessing bias methods provided the most consistent performance across evaluation sites in promoting fairness. FairOdds-AUC offers a flexible, interpretable way to evaluate clinical AI where fairness matters. Our findings support the recommendation to incorporate fairness evaluations and fairness-aware model training for future ophthalmic AI applications. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhao et al. (Tue,) studied this question.
synapsesocial.com/papers/6a0fa0352badbc352afe72e2 — DOI: https://doi.org/10.1016/j.xops.2026.101119
Yihan Zhao
Stanford University
Rohith Ravindranath
Smith-Kettlewell Eye Research Institute
Tina Hernandez‐Boussard
Preventive Cardiology
Ophthalmology Science
Stanford University
University of Michigan
Smith-Kettlewell Eye Research Institute
Building similarity graph...
Analyzing shared references across papers
Loading...