Cluster Validity Indices (CVIs) act as a pivotal tool in machine learning for assisting in the determination of the optimal number of clusters. Nevertheless, traditional CVIs often exhibit subpar performance when confronted with the complex characteristics prevalent in real-world data, such as inter-cluster overlap, outliers and uneven density distribution. To address this challenge, this paper proposes a multiplicative, adaptive and robust Cluster Validity Index, designated as the Robust Adaptive (RA) index. This index takes the kernel density function of sample points as the fundamental tool and reconstructs its two core components: in the measurement of intra-cluster compactness, the concept of density quantiles is incorporated, which markedly enhances its robustness against outliers; in the measurement of inter-cluster separability, a density-based Jeffrey divergence method is developed to effectively characterize inter-cluster differences in overlapping datasets. To mitigate the impact of bandwidth selection on kernel density estimation, this study adopts strategies including Scott’s and Silverman’s heuristic algorithms, thus enabling adaptive learning of the inherent distribution characteristics of data. For experimental validation, a comprehensive set of experiments was conducted on both synthetic and real-world datasets. The results show that, in comparison with the classical indices (CH, DB, SIL, I) that demonstrate prominent performance on overlapping datasets, the RA index delivers superior performance in scenarios involving mild to moderate overlap, uneven density distribution and the presence of outliers. Among nine synthetic datasets, the RA index correctly identified the optimal number of clusters in eight cases, achieving a high success rate of 88.89% and outperforming all the comparative indices. On eight real-world datasets with diverse scales, dimensionalities and inherent structural features, the RA index was also verified to be the most robust and effective metric among the five participating indices for comparison. Meanwhile, its failure on complex datasets such as S-set4 and Iris, which contain both severe inter-cluster overlap and outliers, also indicates that density-based CVIs have inherent limitations when faced with data structures characterized by high overlap and faint cluster boundaries. This finding points to a clear direction for future research: constructing novel CVIs from the perspective of sparse matrices may serve as a feasible breakthrough path to address such limitations.
Yan et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: