August 5, 2025Open Access

Comparison of Imputation Strategies for Incomplete Electronic Health Data

Key Points

MICE and MissForest are the top-performing imputation methods in various missingness scenarios.
Deep learning methods like GAIN showed instability, especially with higher missingness in electronic health records.
The study evaluated imputation quality using statistical measures across three clinical datasets.
Choosing the right imputation strategy is crucial, as quality does not always correlate with classification accuracy.

Abstract

Missing data is a persistent challenge in electronic health records (EHRs), often compromising data integrity and limiting the effectiveness of predictive models in healthcare. This study systematically evaluates five widely used imputation strategies—GAIN, MICE, Median, MissForest, and MIWAE—across three real-world clinical datasets under varying missingness mechanisms (MCAR, MAR, and MNAR) and missingness rates (10%–90%). We assessed imputation quality using multiple statistical measures and examined the relationship between imputation accuracy and downstream classification performance. Our results show that MICE and MissForest consistently outperform other methods across most scenarios, while deep learning-based approaches such as GAIN exhibit high instability under MAR and MNAR, particularly at higher missingness levels. Furthermore, imputation quality does not always align with classification performance, underscoring the need to consider task-specific goals when selecting imputation strategies. We also provide a practical framework summarizing method recommendations based on missingness type and rate, aiming to support robust data preprocessing decisions in clinical AI applications.

Comparison of Imputation Strategies for Incomplete Electronic Health Data

Key Points

Abstract

Cite This Study