October 6, 2021

Using Pre-Processing Methods to Improve OCR Performances of Digital Historical Documents

Key Points

Key points are not available for this paper at this time.

Abstract

The historical documents digitized and scanned with OCR opening access to online translation is one of the major priorities of both Turkey and others in world compilation libraries. Problems are encountered in the process of passing the Turkish documents written in Latin letters that are old and difficult to read through OCR. The aim of this study is to examine the effects of 3 different thresholding algorithms applied on OCR performance after up sampling for historical documents as preprocessing. This study shows that the success of the OCR process can be increased by using up sampling and image thresholding techniques to be applied to digital documents before OCR. The words obtained from the processed documents were tested with NLP libraries, and the success of the proposed method was measured by determining their presence in the Turkish language. The proposed study was applied to 30 different first pages of newspapers between 1930 and 1970. It was observed that the words in the documents in which the proposed method was applied were detected with a better accuracy of nearly 18%.

Bookmark

Using Pre-Processing Methods to Improve OCR Performances of Digital Historical Documents

Key Points

Abstract

Cite This Study