Data preprocessing and feature engineering play key roles in data mining initiatives, as they have a significant impact on the accuracy, reproducibility, and interpretability of analytical results. This review presents an analysis of state-of-the-art techniques and tools that can be used in data input preparation and data manipulation to be processed by mining tasks in diverse application scenarios. Additionally, basic preprocessing techniques are discussed, including data cleaning, normalisation, and encoding, as well as more sophisticated approaches regarding feature construction, selection, and dimensionality reduction. This work considers manual and automated methods, highlighting their integration in reproducible, large-scale pipelines by leveraging modern libraries. We also discuss assessment methods of preprocessing effects on precision, stability, and bias–variance trade-offs for models, as well as pipeline integrity monitoring, when operating environments vary. We focus on emerging issues regarding scalability, fairness, and interpretability, as well as future directions involving adaptive preprocessing and automation guided by ethically sound design philosophies. This work aims to benefit both professionals and researchers by shedding light on best practices, while acknowledging existing research questions and innovation opportunities.
Koukaras et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: