March 3, 2026Open Access

Article based topic modeling on re-processed historical newspapers

Key Points

Topic modeling isolates sections of historical newspapers effectively, highlighting primary trade and political topics.
A dataset of 425,194 pages from 1803 to 1945 was re-processed for improved machine readability and analysis.
Application of four topic models revealed that tomotopy LDA is the most suitable for large datasets like these.
This approach revolutionizes how researchers access and analyze vast amounts of information in historical texts.

Abstract

Digitized historical newspapers are a promising source for economic, social, and political history. In spite of the fact that many historical newspapers are available today in some digitized form, their quantitative analysis has only sporadically found its way into the canon of scientific methods and is far from becoming a standard approach, mostly due to access limitations and unsuitable data formats for digital analysis. We propose to re-process large quantities of digital images of newspapers into a better machine-readable and paragraph-segmented form and use topic modeling techniques to identify and track topics in newspapers over time and to create topic-specific sub-corpora. This approach will serve to identify relevant articles for any number of further research questions in a mere matter of hours, eliminating months of flicking through web-viewers or copying results from keyword searches. Most topic models are designed for smaller corpora. Since historical newspapers are now available in enormous quantities, their applicability stands to be questioned. We create a large dataset by re-processing all digitally available pages of the Kölnische Zeitung, consisting of 425,194 pages from 1803--1945. Subsequently, we investigate the application of four different topic models (Gensim LDA, tomotopy LDA, LeetTopic and BERTopic) on this large dataset, to demonstrate their (un)suitability for processing datasets of this scale. Among these methods, the tomotopy LDA implementation proves most reliable on large datasets. We show that topic modeling can easily isolate sections of the newspaper which are of particular interest to researchers, like trade registry entries and various kinds of labor market ads, but can also identify and isolate prominent political topics in the newspaper articles.

Bookmark

View Full Paper

Cite This Study

Kuebart et al. (Thu,) studied this question.

synapsesocial.com/papers/69a768afbadf0bb9e87e5981 https://doi.org/https://doi.org/10.5617/dhnbpub.13103

Bookmark

View Full Paper