This report offers a comprehensive technical and economic analysis of unsupervised semantic clustering as a cost-effective alternative to supervised machine learning for extracting insights from unstructured customer feedback. The main argument is that the high costs and long delays associated with manual data labelling—the basis of supervised models—create a significant bottleneck for businesses leveraging Voice of Customer (VoC) data. An economic analysis shows that labelling a moderate dataset can cost tens of thousands of dollars and take months, potentially leaving insights outdated. This "labelling tax" is not a one-time cost but a recurring expense due to concept drift, increasing the financial burden. As a solution, the analysis evaluates an unsupervised pipeline that transforms raw text into geometric representations to uncover latent semantic structures without human labels. The process involves multiple stages: first, text is vectorised using TF-IDF, amplifying sentiment-rich terms. Next, the high-dimensional sparse matrix is compressed with Truncated SVD, reducing dimensionality and grouping synonyms into latent concepts. Finally, K-Means partitions the data into clusters. Empirical validation on the IMDB movie review dataset shows the pipeline’s effectiveness; rigorous preprocessing, including removing named entities to avoid topic bias, enabled the algorithm to identify two clusters that, when visualised with t-SNE, aligned strongly with positive and negative sentiment labels. While the accuracy (~70%) is lower than that of supervised models (>95%), the report argues that this performance is adequate for many strategic uses, such as trend analysis and customer segmentation. The approach's advantages—scalability, domain independence, and the discovery of emergent "unknown unknowns"—position unsupervised clustering as a high-ROI, scalable tool for organisations seeking to quickly and affordably unlock their text data.
Partha Majumdar (Wed,) studied this question.