What question did this study set out to answer?

The aim is to optimize training datasets for machine learning by assessing data quality and preventing common pitfalls like overfitting.

April 24, 2026Open Access

DataAiPrep: A comprehensive machine learning data quality assessment tool for training dataset optimization

Key Points

The aim is to optimize training datasets for machine learning by assessing data quality and preventing common pitfalls like overfitting.
Developed a platform implementing statistical analysis and multi-method feature selection.
Integrated multi-layered leakage detection and scalable processing with Dask.
Validated the tool on benchmark datasets with known data quality issues.
Achieved high accuracy in detecting data quality issues: Missing Data Patterns (98.5%), Data Leakage (95.8%).
Demonstrated an average F1-score improvement of 23.1% in classification tasks after applying preprocessing recommendations.
Achieved a 23.4% reduction in RMSE for regression tasks through improved dataset quality.

Abstract

This paper presents DataAiPrep, a comprehensive machine learning data quality assessment platform designed to optimize training datasets and prevent common issues that lead to underfitting, overfitting, and poor model performance. The software implements advanced statistical analysis, automated multi-method feature selection with consensus voting, SHapley Additive exPlanations (SHAP)-based explainability, multi-layered leakage detection (train–test contamination via row hashing, cosine similarity-based near-duplicate identification, entity-aware group leakage, and target correlation analysis), and scalable processing via Dask integration. Key technical contributions include an ensemble-driven feature selection framework that aggregates Boruta, RFECV, LASSO/Elastic Net, and mRMR through a consensus voting mechanism, a hierarchical leakage detection pipeline, and ensemble outlier detection using IQR, Z-score, Isolation Forest, Local Outlier Factor, and DBSCAN with consensus scoring. Empirical validation across 10 purpose-built benchmark datasets with known ground-truth issues demonstrated strong detection accuracy: Missing Data Patterns (98.5%), Data Leakage (95.8%), Outlier Detection (93.7%), Feature Redundancy (96.3%), Distribution Anomalies (94.2%), and High Cardinality Issues (99.1%). A comparative evaluation across 30 datasets showed that applying DataAiPrep’s preprocessing recommendations led to an average improvement of 23.1% in F1-score for classification tasks and 23.4% reduction in RMSE for regression tasks.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Massaoudi et al. (Wed,) studied this question.

synapsesocial.com/papers/69eb099a553a5433e34b3ff5 https://doi.org/https://doi.org/10.1016/j.softx.2026.102662

Demander à l'IA

Bookmark

View Full Paper