Key points are not available for this paper at this time.
Abstract Data in the real world typically have disproportionate distribution, which makes it difficult to extract meaningful insights. In binary classification problems, often the number of instances in certain class dominates the others, making existing classification models struggle. Over the decades, class imbalance problems have been actively discussed, with most of the efforts have focused on improving the general classification performance. However, with the rapid advancement in data processing and acquiring technologies, entering the era of big data, such achievements raise concerns about their practical usability. Undersampling, a well-known data-level solution to address the class imbalance problem, faces these challenges particularly in the context of highly imbalanced big data. Many existing undersampling methods become inefficient in this setting, either due to high computational costs or by retaining excessive portions of the majority class. This paper introduces two novel undersampling methods, particle stacking undersampling (PSU)-m and PSU-mm, which are designed to more efficiently represent the majority class using a substantially reduced amount of data. These methods achieve sub-quadratic complexity under general conditions and quasi-linear complexity when the size of the majority class exceeds both the minority class size and the feature dimensionality. To evaluate the effectiveness of the proposed methods, benchmark comparisons were conducted against 14 well-known undersampling methods. The evaluation considered three key practical aspects: classification performance, processing time and the proportion of data retained after the undersampling, referred to as the resulting ratio in this study. Numerical results show that the proposed methods achieved the most balanced performance across these metrics.
Yongseok Jeon (Wed,) studied this question.