Short text clustering presents significant challenges due to the inherent characteristics of the data, which are highly susceptible to noise, thereby impacting clustering results. Enhancing the model's robustness to noise and improving clustering performance are crucial issues in this area. To address these challenges, we propose a novel short text clustering model (SCHACL) that combines hybrid data augmentation and contrastive learning to improve the model's robustness, particularly when dealing with noisy data. SCHACL consists of two key modules: hybrid data augmentation and contrastive clustering. The hybrid data augmentation module applies a mix of strong and weak augmentations to the original data, thereby enhancing the model’s ability to handle noisy data. In addition, we introduce a new data augmentation technique for clustering tasks—Pos-based Deletion. The contrastive clustering module integrates contrastive learning into the short text clustering process, enabling the model to better differentiate between distinct data clusters. We evaluate SCHACL’s performance on eight short text datasets, and the experimental results demonstrate that SCHACL significantly outperforms several baseline models. Specifically, accuracy improves by 0.72%-4.16%, and normalized mutual information increases by 0.27%-2.68%. These results confirm that the proposed approach effectively enhances the model’s robustness and significantly improves clustering performance.
Zhang et al. (Sun,) studied this question.