April 27, 2021Open Access

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling

Key Points

Key points are not available for this paper at this time.

Abstract

Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Anton Thielmann

Clausthal University of Technology

Christoph Weisser

Witten/Herdecke University

Astrid Krenz

Journals

Journal of Applied Statistics

Actions

Institutions

University of Sussex

University of Göttingen

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study