August 1, 1999Open Access

Fast and effective text mining using linear-time document clustering

Key Points

Key points are not available for this paper at this time.

Abstract

Clustering is a powerful technique for large-scale topic discovery from text. It involves two phases: first, feature extraction maps each document or record to a point in high-dimensional space, then clustering algorithms automatically group the points into a hierarchy of clusters. We describe an unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase. We introduce a methodology for measuring the quality of a cluster hierarchy in terms of F-Measure, and present the results of experiments comparing different algorithms. The evaluation considers some feature selection parameters (tfidfand feature vector length) but focuses on the clustering algorithms, namely techniques from Scatter/Gather (buckshot, fractionation, and split/join) and kmeans. Our experiments suggest that continuous center adjustment contributes more to cluster quality than seed selection does. It follows that using a simpler seed selection algorithm gives a better time/quality tradeoff. We describe a refinement to center adjustment, "vector average damping," that further improves cluster quality. We also compare the near-linear time algorithms to a group average greedy agglomerative clustering algorithm to demonstrate the time/quality tradeoff quantitatively.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Bjornar Larsen

Chinatsu Aone

SRA International (United States)

Actions

Institutions

SRA International (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Fast and effective text mining using linear-time document clustering

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study