High-dimensional mixed data often lack a unified semantic representation for continuous and discrete attributes, which hinders mixed-attribute similarity modeling and can result in unstable reducts and overfitting in existing neighborhood rough set (NRS) methods. To address this issue, we propose IF-EMD-SPA, an attribute reduction method for NRS grounded in Information Flow theory. Unlike conventional NRS methods that rely on discretization or a single reduction criterion, IF-EMD-SPA first establishes a unified representation framework for heterogeneous attributes based on classifications and an Information Channel Core. It then integrates Earth Mover’s Distance (EMD) and Set Pair Analysis (SPA) to define a similarity metric for mixed attributes. In addition, a three-stage greedy reduction strategy is designed under the dual constraints of dependency preservation and structural error, consisting of dependency-driven forward selection, similarity-driven structure completion, and backward redundancy removal. Experiments on five UCI benchmark datasets and two high-dimensional gene expression datasets show that IF-EMD-SPA achieves average accuracies of 93.5% (k-Nearest Neighbors, KNN), 93.9% (Support Vector Machine, SVM), and 90.8% (Classification and Regression Trees, CART), with SVM achieving the best results on all seven datasets. Under CART, it reaches 100% accuracy on Wine and WPBC, improving performance by up to 37.5 percentage points over comparison methods.
Zhang et al. (Fri,) studied this question.