March 1, 2024

Synthetic Data Generation for Document Text Recognition

Key Points

Key points are not available for this paper at this time.

Abstract

Handwritten text recognition software is used to recognize and extract text from scanned documents. The fun-damental goal of this technology is to transform printed or handwritten text into an easily readable electronic format. The numerous characters and enormous quantity of the information for Indian languages, however, causes explicit preprocessing to take time. This necessity for explicit preprocessing has been replaced by synthetic data creation techniques, which overcome several difficulties and speed up the process. Artificially produced data that closely mimics actual observations is referred to as synthetic data. In situations where getting real data is difficult or expensive, it offers a workable substitute for training machine learning models. In this work, we propose a data preprocessing technique that creates synthetic data files from the already-existing collection of Indian languages. We use a pre-trained language model called FastText model, which is capable of creating word embeddings, to generate synthetic dataset from the real-time dataset. With the aid of the generated synthetic datasets, document text recognition systems can undergo extensive training and testing, allowing them to grasp the complexities of Indian languages and perform accurate text extraction from scanned documents.

Mark Helpful

Bookmark

Relay