What question did this study set out to answer?

This study aims to assess different methods for imputing missing race and ethnicity data in electronic health records to reduce bias in pediatric research.

March 18, 2026Open Access

Handling Missing Race and Ethnicity in an EHR-Based Study Through Integration of Individual Measures and Neighborhood Sociodemographic and Socioeconomic Measures

Key Points

This study aims to assess different methods for imputing missing race and ethnicity data in electronic health records to reduce bias in pediatric research.
Compared four imputation methods: logistic regression, random forest, k-nearest neighbors (KNN), and multiple imputation by chained equations (MICE).
Analyzed data from 5309 children treated for Staphylococcus aureus infections in metropolitan Atlanta from 2002 to 2015.
Evaluated performance using accuracy and weighted F1 metrics on a held-out test set of 554.
Logistic regression and KNN performed best for race imputation, achieving an accuracy of 0.838 and 0.839 respectively.
KNN showed the highest accuracy for ethnicity (0.912), while random forest achieved the highest weighted F1 (0.895).
Imputation performance varied by demographic attribute, with lower accuracy for Hispanic ethnicity and the 'Other' race category.

Abstract

Race and ethnicity are frequently missing in electronic health records (EHRs), where excluding these records can bias pediatric research and disparity estimates. Imputing missing values may reduce this bias but can perform unevenly across methods and subgroups, especially for smaller or heterogeneous categories. We compared four approaches—logistic regression, random forest, k-nearest neighbors (KNN), and multiple imputation by chained equations (MICE)—to impute missing race (Black/White/Other) and ethnicity (Hispanic/Non-Hispanic) using individual- and census-tract-level sociodemographic measures. We analyzed 5309 children (<19 years) treated for Staphylococcus aureus infections at two pediatric hospitals in metropolitan Atlanta (2002–2015). The performance was evaluated on a held-out test set (n = 554) using accuracy and weighted F1. For race, logistic regression and KNN performed best (accuracy/weighted F1: 0.838/0.822 and 0.839/0.823), followed by random forest (0.798/0.787), with MICE being the lowest (0.736/0.743). For ethnicity, KNN achieved the highest accuracy (0.912) and random forest the highest weighted F1 (0.895) (logistic regression 0.901/0.876; random forest 0.904/0.895; KNN 0.912/0.887; MICE 0.866/0.864). Performance was the lowest for Hispanic ethnicity and the “Other” race category, consistent with the class imbalance. Imputation performance depends on the demographic attribute and modeling approach; subgroup-specific evaluation is essential when imputing race and ethnicity in pediatric EHR research.

Handling Missing Race and Ethnicity in an EHR-Based Study Through Integration of Individual Measures and Neighborhood Sociodemographic and Socioeconomic Measures

Key Points

Abstract

Cite This Study