January 1, 2024Open Access

Under Sampling Techniques for Handling Unbalanced Data with Various Imbalance Rates: A Comparative Study

Key Points

Key points are not available for this paper at this time.

Abstract

Unbalanced data sets represent data sets that contain an unequal number of examples for different classes. This dataset represents a problem faced by machine learning tools; as in datasets with high imbalance ratios, false negative rate per-centages will be increased because most classifiers will be affected by the major class. Choosing specific evaluation metrics that are most informative and sampling techniques represent a common way to handle this problem. In this paper, a comparative analysis between four of the most common under-sampling techniques is conducted over datasets with various imbalance rates (IR) range from low to medium to high IR. Decision Tree classifier and twelve imbalanced data sets with various IR are used for evaluating the effects of each technique depending on Recall, F1-measure, gmean, recall for minor class, and F1-measure for minor class evaluation metrics. Results demonstrate that Clusters Centroid outperformed Neighborhood Cleaning Rule (NCL) based on recall for all low IR datasets. For both medium, and high IR datasets NCL, and Random Under Sampling (RUS) outperformed the rest techniques, while Tomek Link has the worst effect.

Bookmark

View Full Paper

Cite This Study

Elsoud et al. (Mon,) studied this question.

synapsesocial.com/papers/6a0c41dfe28175e95a233fa3 https://doi.org/https://doi.org/10.14569/ijacsa.2024.01508124

Bookmark

View Full Paper