What question did this study set out to answer?

The study aims to develop a context-aware correction framework for improving post-OCR outputs.

May 13, 2026

A Framework and Dataset for Contextual Post-OCR Correction

Key Points

The study aims to develop a context-aware correction framework for improving post-OCR outputs.
Developed a correction model using a pre-trained language model finetuned on supervised corpora.
Utilized larger sentence-level context around erroneous text for corrections.
Conducted experiments on OCR outputs from Indic languages including Hindi, Gujarati, and Marathi.
Reduced character error rates for Hindi from 10.20% to 6.73% and for Gujarati from 6.10% to 1.39%.
Achieved word error rate reductions for Hindi from 28.53% to 14.57% and for Gujarati from 20.85% to 4.16%.
Outperformed seq2seq baselines with significant improvements in diacritic placement and word-boundary errors.

Abstract

Optical Character Recognition (OCR) has accelerated the digitization of printed Indic texts, yet recognition remains error-prone due to ligatures, matras, conjuncts, orthographic variability, and long-range grammatical dependencies. Hence, it is imperative to employ post-OCR error-correction techniques that can exploit broader linguistic cues. This paper presents a context-aware correction framework that leverages a larger sentence-level context around the erroneous span. The correction model inputs the OCR-generated sentence, along with an auxiliary context sentence, and outputs a corrected sequence. Our correction model is a pre-trained language model finetuned on a small, supervised corpora. Experiments demonstrate substantial reductions in the character error rates (10.20% to 6.73% for Hindi, 6.10% to 1.39% for Gujarati, and 8.19% to 3.29% for Marathi) as well as the word error rates (28.53% to 14.57% for Hindi, 20.85% to 4.16% for Gujarati, and 24.89% to 8.28% for Marathi). These results outperform the seq2seq baselines. Error-type analysis indicates the largest improvements for diacritic placement and word-boundary errors. These results demonstrate that supplying a larger context consistently improves post-OCR correction for Indic scripts. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/AbhishekBhandari/Indic-post-ocr-correction.

Bookmark

A Framework and Dataset for Contextual Post-OCR Correction

Key Points

Abstract

Cite This Study