Key points are not available for this paper at this time.
Clustering is a powerful technique for large-scale topic discovery from text. It involves two phases: first, feature extraction maps each document or record to a point in high-dimensional space, then clustering algorithms automatically group the points into a hierarchy of clusters. We describe an unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase. We introduce a methodology for measuring the quality of a cluster hierarchy in terms of F-Measure, and present the results of experiments comparing different algorithms. The evaluation considers some feature selection parameters (tfidfand feature vector length) but focuses on the clustering algorithms, namely techniques from Scatter/Gather (buckshot, fractionation, and split/join) and kmeans. Our experiments suggest that continuous center adjustment contributes more to cluster quality than seed selection does. It follows that using a simpler seed selection algorithm gives a better time/quality tradeoff. We describe a refinement to center adjustment, "vector average damping," that further improves cluster quality. We also compare the near-linear time algorithms to a group average greedy agglomerative clustering algorithm to demonstrate the time/quality tradeoff quantitatively.
Building similarity graph...
Analyzing shared references across papers
Loading...
Bjornar Larsen
Chinatsu Aone
SRA International (United States)
SRA International (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Larsen et al. (Sun,) studied this question.
synapsesocial.com/papers/6a0f8b6cd8c5cf602efcd812 — DOI: https://doi.org/10.1145/312129.312186