August 22, 2004

A cross-collection mixture model for comparative text mining

Key Points

Key points are not available for this paper at this time.

Abstract

In this paper, we define and study a novel text mining prob-lem, which we refer to as comparative text mining. Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differ-ences of these collections along each common theme. This general problem subsumes many interesting applications, in-cluding business intelligence, summarizing reviews of similar products, and comparing different opinions about a common topic. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously per-forms cross-collection clustering and within-collection clus-tering, and can be applied to an arbitrary set of compara-ble text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model. 1.

A cross-collection mixture model for comparative text mining

Key Points

Abstract

Cite This Study

Also Consider

Also Consider