How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis

Key Points

Key points are not available for this paper at this time.

Abstract

Thanks go to Simon Byers for providing the NNclean denoising procedure. We consider the problem of determining the structure of clustered data, without prior knowledge of the number of clusters or any other information about their composition. Data are represented by a mixture model in which each component corresponds to a different cluster. Models with varying geometric properties are obtained through Gaussian components with different parameterizations and cross-cluster constraints. Noise and outliers can be modeled by adding a Poisson process component. Partitions are determined by the EM (expectation-maximization) algorithm for maximum likelihood, with initial values from agglomerative hierarchical clustering. Models are compared using an approximation to the Bayes factor based on the Bayesian Information Criterion (BIC); unlike significance tests, this allows comparison of more than two models at the same time, and removes the restriction that the models compared be nested. The problems of determining the number of clusters and the clustering method are solved simultaneously by choosing the best model. Moreover, the EM result provides a

Mark Helpful

Bookmark

Relay