Race and ethnicity are frequently missing in electronic health records (EHRs), where excluding these records can bias pediatric research and disparity estimates. Imputing missing values may reduce this bias but can perform unevenly across methods and subgroups, especially for smaller or heterogeneous categories. We compared four approaches—logistic regression, random forest, k-nearest neighbors (KNN), and multiple imputation by chained equations (MICE)—to impute missing race (Black/White/Other) and ethnicity (Hispanic/Non-Hispanic) using individual- and census-tract-level sociodemographic measures. We analyzed 5309 children (<19 years) treated for Staphylococcus aureus infections at two pediatric hospitals in metropolitan Atlanta (2002–2015). The performance was evaluated on a held-out test set (n = 554) using accuracy and weighted F1. For race, logistic regression and KNN performed best (accuracy/weighted F1: 0.838/0.822 and 0.839/0.823), followed by random forest (0.798/0.787), with MICE being the lowest (0.736/0.743). For ethnicity, KNN achieved the highest accuracy (0.912) and random forest the highest weighted F1 (0.895) (logistic regression 0.901/0.876; random forest 0.904/0.895; KNN 0.912/0.887; MICE 0.866/0.864). Performance was the lowest for Hispanic ethnicity and the “Other” race category, consistent with the class imbalance. Imputation performance depends on the demographic attribute and modeling approach; subgroup-specific evaluation is essential when imputing race and ethnicity in pediatric EHR research.
Li et al. (Sat,) studied this question.