January 25, 2013

A Similarity Measure for Text Classification and Clustering

Key Points

Key points are not available for this paper at this time.

Abstract

Measuring the similarity between documents is an important operation in the text processing field. In this paper, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature, the proposed measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. For the first case, the similarity increases as the difference between the two involved feature values decreases. Furthermore, the contribution of the difference is normally scaled. For the second case, a fixed value is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems. The results show that the performance obtained by the proposed measure is better than that achieved by other measures.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yung-Shen Lin

Jung-Yi Jiang

National Cheng Kung University

Shie-Jue Lee

Journals

IEEE Transactions on Knowledge and Data Engineering

Actions

Institutions

National Sun Yat-sen University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Document clustering using locality preserving indexing· 2005 · 797 citations
Exact and approximation algorithms for clustering· 1998 · 94 citations
Cluster analysis and display of genome-wide expression patterns· 1998 · 16,387 citations
Special Issue on Lazy Learning· 1997 · 97 citations
An Information-Theoretic Definition of Similarity· 1998 · 3,671 citations

A Similarity Measure for Text Classification and Clustering

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider