What question did this study set out to answer?

This research aims to develop a fuzzy inference framework for data preprocessing in machine learning that addresses error severity effectively.

May 21, 2026Open Access

Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails

Key Points

This research aims to develop a fuzzy inference framework for data preprocessing in machine learning that addresses error severity effectively.
Introduces GDEDC, a three-stage Mamdani-type fuzzy inference framework.
Evaluates the framework across five UCI datasets and the Pima Indians diabetes dataset with various classifiers and noise levels.
Compares the performance of GDEDC against five baseline methods, including MICE.
GDEDC matches or surpasses KNN Imputation and MICE in noise levels ≥20%, achieving the best Friedman rank at 20–30% noise.
In the Pima dataset, GDEDC outperforms IQR by +2.97% (p < 0.001, d = 0.684).
Ablation analysis reveals sigmoid-based proportional correction contributes +2.02 pp to performance.

Abstract

Data preprocessing methods for machine learning overwhelmingly rely on binary logic—a value is either valid or invalid—and the corrective action does not scale with error severity. This paper introduces GDEDC, a Mamdani-type fuzzy inference framework that replaces binary preprocessing with graded error detection and proportional correction. Operating in three stages—fuzzy anomaly scoring, nine-rule Mamdani FIS classification, and sigmoid-weighted imputation—the framework corrects each value in proportion to its estimated error severity while retaining 100% of observations and producing a human-readable audit trail. We evaluate GDEDC on five UCI datasets and the Pima Indians diabetes dataset with five classifiers across six noise levels (5–30%), comparing against five baselines including MICE. Under leakage-free conditions, deletion-based methods consistently underperform raw data, while correction-based methods (GDEDC, KNN Imputation, MICE) deliver significant improvements. GDEDC matches KNN Imputation and MICE at low noise and surpasses both at ≥20% noise: on noise-sensitive classifiers, GDEDC achieves the best Friedman rank at 20–30% noise. Real-world validation on the Pima dataset confirms generalizability, with GDEDC outperforming IQR by +2.97% (p < 0.001, d = 0.684). Ablation analysis shows that sigmoid-based proportional correction is the primary contributor (+2.02 pp), and the full pipeline outperforms every ablated variant at 10–20% noise.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

A. Tekín (Tue,) studied this question.

synapsesocial.com/papers/6a0ea13abe05d6e3efb5fabb https://doi.org/https://doi.org/10.3390/app16105072

Bookmark

View Full Paper