August 1, 1998

Distributional clustering of words for text classification

Key Points

Key points are not available for this paper at this time.

Abstract

This paper describes the application of Distributional Clustering 20 to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionalityreduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document classification accuracy. Experimental results obtained on three real-world data sets show that we can reduce the feature dimensionality by three orders of magnitude and lose only 2% accuracy---significantly better than Latent Semantic Indexing 6, class-based clustering 1, feature selection by mutual information 23, or Markov-blanket-based feature selection 13. We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering. 1 Introduction The popularity of the Internet has caused an exponent...

KI fragen

Bookmark

KI fragen

Bookmark

Distributional clustering of words for text classification

Key Points

Abstract

Cite This Study