January 1, 2016Open Access

Effects of Creativity and Cluster Tightness on Short Text Clustering Performance

Key Points

Key points are not available for this paper at this time.

Abstract

Properties of corpora, such as the diversity of vocabulary and how tightly related texts cluster together, impact the best way to cluster short texts. We examine several such properties in a variety of corpora and track their effects on various combinations of similarity metrics and clustering algorithms. We show that semantic similarity metrics outperform traditional n-gram and dependency similarity metrics for kmeans clustering of a linguistically creative dataset, but do not help with less creative texts. Yet the choice of similarity metric interacts with the choice of clustering method. We find that graphbased clustering methods perform well on tightly clustered data but poorly on loosely clustered data. Semantic similarity metrics generate loosely clustered output even when applied to a tightly clustered dataset. Thus, the best performing clustering systems could not use semantic metrics.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Finegan‐Dollak et al. (Fri,) studied this question.

synapsesocial.com/papers/6a1f85b4b24abb7dd47ed2a6 — DOI: https://doi.org/10.18653/v1/p16-1062

Authors

Catherine Finegan‐Dollak

University of Richmond

Reed Coke

Rui Zhang

University of Science and Technology of China

Actions

Institutions

University of Michigan

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Effects of Creativity and Cluster Tightness on Short Text Clustering Performance

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion