This paper presents DataAiPrep, a comprehensive machine learning data quality assessment platform designed to optimize training datasets and prevent common issues that lead to underfitting, overfitting, and poor model performance. The software implements advanced statistical analysis, automated multi-method feature selection with consensus voting, SHapley Additive exPlanations (SHAP)-based explainability, multi-layered leakage detection (train–test contamination via row hashing, cosine similarity-based near-duplicate identification, entity-aware group leakage, and target correlation analysis), and scalable processing via Dask integration. Key technical contributions include an ensemble-driven feature selection framework that aggregates Boruta, RFECV, LASSO/Elastic Net, and mRMR through a consensus voting mechanism, a hierarchical leakage detection pipeline, and ensemble outlier detection using IQR, Z-score, Isolation Forest, Local Outlier Factor, and DBSCAN with consensus scoring. Empirical validation across 10 purpose-built benchmark datasets with known ground-truth issues demonstrated strong detection accuracy: Missing Data Patterns (98.5%), Data Leakage (95.8%), Outlier Detection (93.7%), Feature Redundancy (96.3%), Distribution Anomalies (94.2%), and High Cardinality Issues (99.1%). A comparative evaluation across 30 datasets showed that applying DataAiPrep’s preprocessing recommendations led to an average improvement of 23.1% in F1-score for classification tasks and 23.4% reduction in RMSE for regression tasks.
Massaoudi et al. (Wed,) studied this question.