Handwritten character recognition is a fundamental research area in pattern recognition and document image analysis. It has wide-ranging applications such as digitization of handwritten documents, automation of data entry systems, archival of historical manuscripts, and development of assistive technologies. In the Indian context, the Devanagari script holds special importance as it is used by several major languages including Hindi, Marathi, Sanskrit, Nepali, and Konkani. However, the recognition of handwritten Devanagari characters remains a challenging task due to the complexity of character structures, the presence of modifiers, the shirorekha (headline), and the large variability in handwriting styles. While many recent studies focus on deep learning-based recognition systems, the importance of dataset quality and preprocessing remains fundamental for building any reliable optical character recognition (OCR) system. Poor preprocessing leads to noisy feature representations, reduced classification accuracy, and poor generalization across writers. In many practical scenarios, preprocessing contributes more to system reliability than the choice of classifier itself. This paper focuses on the first and most essential objective of handwritten Devanagari OCR: the use of publicly available benchmark datasets and their systematic preprocessing to obtain clean, standardized handwritten character samples for training and evaluation. Two widely used datasets, namely the Devanagari Handwritten Character Dataset (DHCD) and the ECO-LAPS dataset, are selected for this study. A complete preprocessing pipeline is designed consisting of normalization, grayscale conversion, binarization, noise removal, morphological processing, shirorekha removal, contour refinement, and size standardization. The study highlights the importance of dataset preparation in reducing intra-class variations and improving recognition reliability. The resulting standardized dataset forms a strong foundation for classical machine learning based OCR systems and ensures reproducibility and fair evaluation.
Shahajahan et al. (Sun,) studied this question.