October 10, 2025Open Access

Comparative analysis of imputation methods in machine learning models

Puntos clave

Imputation methods significantly affect the performance of machine learning models such as Random Forest and SVM.
Mean/Median Imputation and K-Nearest Neighbors show varied impacts on predictive accuracy in machine learning applications.
The selection of imputation techniques should align with dataset attributes for optimal model fit and stability.
Careful experimentation is essential to understand the consequences of different imputation methods on model performance.

Resumen

Missing data is a prevalent issue in machine learning and data analysis that impacts the credibility and performance of predictive models. This article provides a comprehensive study of missing data, its types, consequences, and popular imputation methods. Using real datasets, we compare the performance of Mean/Median Imputation, K-Nearest Neighbors (KNN) Imputation, Multiple Imputation, Regression Imputation, and Hot Deck Imputation. Furthermore, we study how these imputation techniques affect machine learning models such as Random Forest, Gradient Boosting Machines (GBM), and Support Vector Machines (SVM). Our study emphasizes the need for careful experimentation and model-specific investigation when handling missing data, where an important part is played by the selection of suitable imputation techniques based on dataset attributes and machine learning models. Lastly, our findings underscore the importance of tailored imputation strategies in enhancing model fit and ensuring stable analytical findings.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo