Optical Character Recognition (OCR) has accelerated the digitization of printed Indic texts, yet recognition remains error-prone due to ligatures, matras, conjuncts, orthographic variability, and long-range grammatical dependencies. Hence, it is imperative to employ post-OCR error-correction techniques that can exploit broader linguistic cues. This paper presents a context-aware correction framework that leverages a larger sentence-level context around the erroneous span. The correction model inputs the OCR-generated sentence, along with an auxiliary context sentence, and outputs a corrected sequence. Our correction model is a pre-trained language model finetuned on a small, supervised corpora. Experiments demonstrate substantial reductions in the character error rates (10.20% to 6.73% for Hindi, 6.10% to 1.39% for Gujarati, and 8.19% to 3.29% for Marathi) as well as the word error rates (28.53% to 14.57% for Hindi, 20.85% to 4.16% for Gujarati, and 24.89% to 8.28% for Marathi). These results outperform the seq2seq baselines. Error-type analysis indicates the largest improvements for diacritic placement and word-boundary errors. These results demonstrate that supplying a larger context consistently improves post-OCR correction for Indic scripts. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/AbhishekBhandari/Indic-post-ocr-correction.
Bhandari et al. (Mon,) studied this question.