What type of study is this?

September 10, 2025Open Access

Modular Pipeline for Text Recognition in Early Printed Books Using Kraken and ByT5

Key Points

The modular pipeline improves OCR accuracy by reducing the Character Error Rate from around 38% to 15%.
Using Kraken for text segmentation and ByT5 for correction, the study achieved a BLEU score improvement from 22 to 44.
The approach combines historical layout analysis with modern language modeling techniques for increased reliability.
The robust preprocessing stage involves advanced statistical filters to enhance degraded page legibility before OCR.

Abstract

Early printed books, particularly incunabula, are invaluable archives of the beginnings of modern educational systems. However, their complex layouts, antique typefaces, and page degradation caused by bleed-through and ink fading pose significant challenges for automatic transcription. In this work, we present a modular pipeline that addresses these problems by combining modern layout analysis and language modeling techniques. The pipeline begins with historical layout-aware text segmentation using Kraken, a neural network-based tool tailored for early typographic structures. Initial optical character recognition (OCR) is then performed with Kraken’s recognition engine, followed by post-correction using a fine-tuned ByT5 transformer model trained on manually aligned line-level data. By learning to map noisy OCR outputs to verified transcriptions, the model substantially improves recognition quality. The pipeline also integrates a preprocessing stage based on our previous work on bleed-through removal using robust statistical filters, including non-local means, Gaussian mixtures, biweight estimation, and Gaussian blur. This step enhances the legibility of degraded pages prior to OCR. The entire solution is open, modular, and scalable, supporting long-term preservation and improved accessibility of cultural heritage materials. Experimental results on 15th-century incunabula show a reduction in the Character Error Rate (CER) from around 38% to around 15% and an increase in the Bilingual Evaluation Understudy (BLEU) score from 22 to 44, confirming the effectiveness of our approach. This work demonstrates the potential of integrating transformer-based correction with layout-aware segmentation to enhance OCR accuracy in digital humanities applications.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Momtaz et al. (Fri,) studied this question.

synapsesocial.com/papers/68c19f9154b1d3bfb60dad80 https://doi.org/https://doi.org/10.3390/electronics14153083

Bookmark

View Full Paper