What type of study is this?

This is a Quantitative Study study.

September 27, 2025Open Access

Fine-tuning a model based on the Transformer architecture for normalizing a corpus of medieval texts in German from the 14th-15th centuries from the Order of Prussia.

Key Points

The retrained transformer model shows significant effectiveness in normalizing medieval German texts, achieving an Accuracy OOV of 89.6.
Training was conducted on a custom dataset involving 6,570 original-normalized word pairs, emphasizing specific goals in text normalization.
The study reveals limitations of existing neural language models when applied to historical texts, urging careful consideration of normalization objectives.
Comparative analysis with other models showcased the retrained model's superior performance, highlighting the potential for innovative NLP applications in historical research.

Abstract

The article is dedicated to the methods of automatic normalization of texts in Middle High German and Early New High German for the application of NLP in medieval history research. It provides an overview of existing approaches to the automatic normalization of historical texts in German. The problems of normalizing medieval German texts are identified: the peculiarities of using substitution dictionaries and replacement rules. The limitations of these approaches and the necessity of considering the goals of normalization are described. Neural language models are defined as the most promising for automatic normalization. The study compares the effectiveness of existing neural language models (NMT) with respect to texts in Middle High German and Early New High German. It demonstrates the low effectiveness of using NMT trained on texts from the New and Modern eras. Based on reviews presented in the literature, it asserts the need to prepare NMT according to specific goals and corpora. For the normalization of texts from the 14th-15th centuries created in monastic Prussia, a neural language model based on the Transformer architecture (BART) was further trained, and its effectiveness was presented in comparison with other models. The model was trained on a custom dataset of word pairs: original-normalized, consisting of 6,570 pairs. The conditions for retraining the model were: Epoch = 28; Batch = 50. For normalizing a corpus of texts in three historical forms of the German language, the DTAEC Type Normalizer model was chosen. The effectiveness of the retrained model's normalization was compared with existing models trained on German texts from the New and Modern eras based on the metrics of Accuracy, Accuracy OOV, CER, and Levenshtein distance. The retrained model shows significant effectiveness compared to other models. One normalized sentence using the model is proposed for review, and a comparison with a benchmark is conducted. Instances of "hallucinations" in the retrained model were identified. With an Accuracy OOV of 89.6, using this method is considered promising. However, the identified shortcomings in text normalization indicate the necessity of employing additional normalization methods, such as lemmatization.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper