Abstract Synthetic data generation is widely recognized as an approach to improve the quality of neural grammatical error correction (GEC) systems. However, current approaches often lack diversity or are overly simplistic in generating the wide range of grammatical errors made by humans, particularly for low-resource languages such as Arabic. In this study, we developed an error tagging model and a synthetic data generation model to generate a large synthetic dataset in Arabic for GEC. In the error tagging model, the correct sentence is classified into multiple error types that humans are expected to make using the DeBERTav3 model. The Arabic Error Type Annotation (ARETA) tool is used to guide multi-label classification tasks in an error tagging model that divides each sentence into 26 error tags. The synthetic data generation model is a back-translation-based model that generates incorrect sentences by appending error tags before the correct sentence that was generated by the error tagging model using the AraT5 model. In the QALB-14 and QALB-15 test sets, the error tagging model achieved 94.42% F1, which is a state-of-the-art result in identifying error tags in clean sentences. As a result of our syntactic data training in GEC, we achieved a new state-of-the-art result with F1-score of 79.36% in the QALB-14 test set. We generated 30,219,310 synthetic sentence pairs using a synthetic data generation model. Our data are accessible to the public.*.
Alrehili et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: