Abstract As a long-term scientific project on the history of the Italian language, the Lessico Etimologico Italiano (LEI) represents one of the most important and ambitious historical and etymological dictionary projects ever undertaken. The LEI, which started in 1968, documents and analyzes every single word of the Italian language and all Italian dialects from their beginnings to today. Until 2018, the editorial process used the traditional lexicographical method of creating annotated index cards to collect information. Each index card contains text areas, handwritten annotations, and/or textual stamps. Of particular interest are the etymon, context (the source text), and textual stamp, which contains the abbreviated title of the book or source from which it was copied or taken. In this paper, we present a novel approach for efficiently processing a large number of scanned index cards to accelerate the philological work required for producing the LEI. For this purpose, a deep learning workflow for automatic detection, alignment, and recognition of textual stamps on digitized index cards is proposed. To support large-scale indexing under an initially incomplete stamp inventory, we introduce an embedding-based retrieval workflow that enables the identification and integration of previously unseen stamps during operation. Our experimental evaluations show excellent results for stamp detection and stamp recognition, with a mean average precision of 98.80% and an accuracy of 97.02%, respectively. In addition, we compare our stamp recognition approach with two recent Large Language Model (LLM) approaches for OCR. Our best approach achieves 97.02% accuracy, which surpasses the OCR performance of the two LLMs (91.60% and 89.47%, respectively). Furthermore, we compare our approach to a recent multimodal large language model (MM-LLM) on a benchmark dataset and observe substantially lower accuracy for the LLM (up to 84.58%) compared to our approach (98.61%).
Korfhage et al. (Thu,) studied this question.