March 3, 2026Open Access

Unsupervised Semantic Discovery in Customer Feedback: A Technical and Economic Analysis of Label-Free Clustering

Key Points

Unsupervised clustering effectively uncovers semantic structures in customer feedback, reducing reliance on costly manual labelling.
Vectorisation using TF-IDF and dimensionality reduction through Truncated SVD lead to significant insights when processing unstructured data.
Empirical validation showed the method's ability to group customer sentiment into distinct clusters with around 70% accuracy.
Implementing this approach alleviates the recurring financial burden of data labelling, especially amid ongoing concept drift.

Abstract

This report offers a comprehensive technical and economic analysis of unsupervised semantic clustering as a cost-effective alternative to supervised machine learning for extracting insights from unstructured customer feedback. The main argument is that the high costs and long delays associated with manual data labelling—the basis of supervised models—create a significant bottleneck for businesses leveraging Voice of Customer (VoC) data. An economic analysis shows that labelling a moderate dataset can cost tens of thousands of dollars and take months, potentially leaving insights outdated. This "labelling tax" is not a one-time cost but a recurring expense due to concept drift, increasing the financial burden. As a solution, the analysis evaluates an unsupervised pipeline that transforms raw text into geometric representations to uncover latent semantic structures without human labels. The process involves multiple stages: first, text is vectorised using TF-IDF, amplifying sentiment-rich terms. Next, the high-dimensional sparse matrix is compressed with Truncated SVD, reducing dimensionality and grouping synonyms into latent concepts. Finally, K-Means partitions the data into clusters. Empirical validation on the IMDB movie review dataset shows the pipeline’s effectiveness; rigorous preprocessing, including removing named entities to avoid topic bias, enabled the algorithm to identify two clusters that, when visualised with t-SNE, aligned strongly with positive and negative sentiment labels. While the accuracy (~70%) is lower than that of supervised models (>95%), the report argues that this performance is adequate for many strategic uses, such as trend analysis and customer segmentation. The approach's advantages—scalability, domain independence, and the discovery of emergent "unknown unknowns"—position unsupervised clustering as a high-ROI, scalable tool for organisations seeking to quickly and affordably unlock their text data.

Unsupervised Semantic Discovery in Customer Feedback: A Technical and Economic Analysis of Label-Free Clustering

Key Points

Abstract

Cite This Study