Tamil handwritten manuscript digitization is essential towards preserving ancient knowledge in history. Palm leaf manuscripts have cultural information that is rich, however, palm leaf manuscripts are facing the problems of degradation, noise and structural problems like punch holes, which is a problem for automatic text recognition. Consistency of Background, spread of ink and non-text elements is causing traditional OCR to fail. This paper proposes a preprocessing and recognition framework using image enhancement and CNN-Vision Transformer-based model for the detection of Tamil Characters. The image was processed by denoising using Non-Local Means, median filters and followed by Sauvola adaptive thresholding. These punch holes are removed using binary thresholding and morphological dilation. Character regions are extracted using Connected Components Analysis and revealed by CNN-Vision Transformer model. The regions of high confidence are indicated with bounding boxes around manuscript. The proposed framework shows improved character segmentation performance and enhanced reliability in recognition results.
Pavithra et al. (Mon,) studied this question.