Digitized historical newspapers are a promising source for economic, social, and political history. In spite of the fact that many historical newspapers are available today in some digitized form, their quantitative analysis has only sporadically found its way into the canon of scientific methods and is far from becoming a standard approach, mostly due to access limitations and unsuitable data formats for digital analysis. We propose to re-process large quantities of digital images of newspapers into a better machine-readable and paragraph-segmented form and use topic modeling techniques to identify and track topics in newspapers over time and to create topic-specific sub-corpora. This approach will serve to identify relevant articles for any number of further research questions in a mere matter of hours, eliminating months of flicking through web-viewers or copying results from keyword searches. Most topic models are designed for smaller corpora. Since historical newspapers are now available in enormous quantities, their applicability stands to be questioned. We create a large dataset by re-processing all digitally available pages of the Kölnische Zeitung, consisting of 425,194 pages from 1803--1945. Subsequently, we investigate the application of four different topic models (Gensim LDA, tomotopy LDA, LeetTopic and BERTopic) on this large dataset, to demonstrate their (un)suitability for processing datasets of this scale. Among these methods, the tomotopy LDA implementation proves most reliable on large datasets. We show that topic modeling can easily isolate sections of the newspaper which are of particular interest to researchers, like trade registry entries and various kinds of labor market ads, but can also identify and isolate prominent political topics in the newspaper articles.
Kuebart et al. (Thu,) studied this question.