Key points are not available for this paper at this time.
The application of machine learning (ML) is becoming increasingly common in production. However, many ML-projects in production fail due to poor data quality. To increase the quality, data needs to be preprocessed. Hundreds of methods exist for data preprocessing (DPP) that are selected manually depending on use-case requirements. For these reasons, DPP is currently performed unstructured and accounts for 80 % of ML-projects’ duration. Thus, we introduce a structured DPP-approach, in which DPP-methods are recommended based on production use-case requirements by benchmarking identified DPP-methods according to ML-model performance on five data sets. The approach is validated through two new use-cases.
Frye et al. (Fri,) studied this question.