This study evaluates the impact of various missing data imputation methods on classification performance in high-dimensional datasets. Simulated datasets (n = 150, p = 500 and p = 1000) with different correlation structures and missing data rates (10%–50%) were analysed to compare the effectiveness of single imputation methods (mean, median, random, K-nearest neighbours (KNN), singular value decomposition (SVD)) and multiple imputation techniques (missing value imputation with random forests (I-RF), multivariate imputation by chained equations with classification and regression trees (MICE-CART), direct use of regularized regression (DURR) and indirect use of regularized regression (IURR)). Classification performance was measured using extreme learning machine (ELM), evaluated based on the area under the receiver operating characteristic curve (AUC) and balanced accuracy. Results showed that advanced methods (I-RF, MICE-CART, DURR, IURR) closely matched complete-data performance at low missing rates (10%–20%), while DURR and IURR outperformed others at higher missing rates (30%–50%). A real-world application on a breast cancer gene expression dataset further supports these findings, demonstrating that multiple imputation methods, particularly DURR and IURR, yield the most reliable classification outcomes.
Varol et al. (Tue,) studied this question.