Missing data is a prevalent issue in machine learning and data analysis that impacts the credibility and performance of predictive models. This article provides a comprehensive study of missing data, its types, consequences, and popular imputation methods. Using real datasets, we compare the performance of Mean/Median Imputation, K-Nearest Neighbors (KNN) Imputation, Multiple Imputation, Regression Imputation, and Hot Deck Imputation. Furthermore, we study how these imputation techniques affect machine learning models such as Random Forest, Gradient Boosting Machines (GBM), and Support Vector Machines (SVM). Our study emphasizes the need for careful experimentation and model-specific investigation when handling missing data, where an important part is played by the selection of suitable imputation techniques based on dataset attributes and machine learning models. Lastly, our findings underscore the importance of tailored imputation strategies in enhancing model fit and ensuring stable analytical findings.
Melnyk et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: