A 4-variable machine-learning algorithm accurately identified G6PD rs1050828 variant carriers among African American men (AUC 0.87), with external validation confirming high discrimination (AUC 0.94).
Observational (n=5,981)
Sí
Can a machine-learning algorithm using routine EHR data accurately identify African American men who are carriers of the G6PD rs1050828 variant?
A machine-learning algorithm using four routine EHR variables can accurately identify African American men carrying the G6PD rs1050828 variant, enabling targeted screening to prevent diabetes underdiagnosis.
Estimación del efecto: AUC 0.87
Introduction and Objective: The African-specific G6PD variant rs1050828-T (p.Val68Met) results in G6PD deficiency, which lowers HbA1c independently of glycemia, risking diabetes underdiagnosis and undertreatment in African American (AA) individuals. We aimed to develop a machine-learning algorithm that uses routinely collected EHR data to flag likely G6PD variant carriers among AA men, enabling scalable targeted confirmatory testing and individualized HbA1c interpretation. Methods: We analyzed 5,981 AA men from the All of Us (AOU) Research Program (11% rs1050828 hemizygotes). Predictors included HbA1c, outpatient random plasma glucose, hemoglobin glycation index, demographics, other laboratory results, antidiabetic medications, and comorbidities recorded within 2 years before AOU enrollment. A random-forest classifier was tuned via cross-validation on a 75% training set and evaluated on the held-out test set. External validation was performed in the UK Biobank (UKB). Results: The saturated model (62 variables) demonstrated strong discrimination for identifying rs1050828 hemizygotes (AUC 0.89). A parsimonious model using the top 4 predictors—HbA1c, glucose, red cell distribution width, and age—retained high performance (AUC 0.87), offering a scalable clinical option. When translated into a 1-20 clinical risk score, the parsimonious model shown distinct risk stratification; a risk score threshold of ≥18 yielded robust prediction (Sensitivity 93%, Specificity 96%, PPV 75%, NPV 99%). External validation in the UKB showed higher discrimination (AUC 0.94) and better performance (sensitivity 94%, specificity 99%; PPV 94%, NPV 99%). Conclusion: This high-yield algorithm provides a scalable solution for targeted genotyping where universal screening is infeasible. By leveraging 4 routinely collected variables, this approach enables precision screening that resolves systematic underdiagnosis and fosters more equitable diabetes management for AA men. Disclosure Q. Xue: None. P. Li: None. Z. Li: None. Y. Shao: None. E. Mitchell: None. L.S. Phillips: Research Support; Ended; Janssen Pharmaceuticals, Inc. Research Support; Current; Boehringer Ingelheim International GmbH. Other - Diasyst, Inc. is a startup. It produces software for providers, aimed to improve diabetes care. I am cofounder, Medical Director, Board member, and stockholder. Its debts are much larger than its revenue, and in the past year, I received no income from it.; Current; Diasyst Inc. H. Shao: None.
XUE et al. (Fri,) conducted a observational in G6PD deficiency (n=5,981). Machine-learning algorithm was evaluated on Identification of rs1050828 hemizygotes (AUC 0.87). A 4-variable machine-learning algorithm accurately identified G6PD rs1050828 variant carriers among African American men (AUC 0.87), with external validation confirming high discrimination (AUC 0.94).
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: