Spell-checking, including misspelling detection and correction, is a classic problem in the natural language processing community. Most common soft spelling errors are typographical; they occur due to orthographic variations of some Arabic letters, given the identical phonetic sounds. This study aims to experiment and apply a recent state-of-the-art attention-based transformer deep learning model with neural machine translation seq-to-seq loss on a Modern Standard Arabic spell-checking task. We used OpenNMT, an open-source neural network library, to train the Bidirectional Encoder Representations from the Transformers model and the Bidirectional Long short-term memory model as a baseline model for detecting and correcting soft spelling errors in Arabic. The seq-to-seq model converts corrupted text (input sequence) into clean, error-free text (output sequence). The synthetic dataset is generated from the ’SCUT corpus Version 3’ dataset, where we created and applied a random noise injection confusion function. This process involved substituting characters in the text at random positions to simulate spelling errors. The intention was to mimic common human typing or transcription errors, including typographical errors and cognitive misspellings. The corruption ratio injected into the data and the length of the input sequence were considered when assessing the models’ performance. The trained models’ results in terms of Accuracy and Bilingual Evaluation Understudy Score were promising and competitive compared to other solutions.
AbdulNabi et al. (Fri,) studied this question.