Palm leaf manuscript is one of the oldest writing media used in South and Southeast Asia. Its use spanned from the medieval era into the nineteenth century, resulting in millions of manuscripts. These manuscripts encompass knowledge from various disciplines, and they deteriorated over time due to aging and environmental conditions. Hence, the digitization of palm leaf manuscripts is crucial for the preservation and dissemination of their rich knowledge. To improve the digitization performance, the line level segmentation and recognition are recommended over character level. Skewed lines, varied writing styles, and deterioration make line segmentation challenging, further compounded by the lack of diverse and publicly available datasets. To address this gap, we introduce LeafOCR-Line, a dataset for palm leaf text line segmentation, consisting of 1710 text line masked manuscripts with corresponding deterioration levels for each manuscript. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed dataset. LeafOCR-Line will be publicly available as a valuable resource for advancing palm leaf manuscript digitization research.
Sivan et al. (Wed,) studied this question.