What type of study is this?

This is a In Vitro Study study.

August 18, 2025Open Access

Towards the Development of Balanced Synthetic Data for Correcting Grammatical Errors in Arabic: An Approach Based on Error Tagging Model and Synthetic Data Generation Model

Key Points

Achieving a new state-of-the-art F1-score of 79.36% on the QALB-14 test set shows significant improvement.
The study generated over 30 million synthetic sentence pairs, enhancing the diversity of data for grammatical error correction.
Error tagging model utilized the DeBERTav3 model to classify grammatical errors into 26 distinct types.
Synthetic data generation employed back-translation methods, effectively addressing challenges in low-resource languages like Arabic.

Abstract

Abstract Synthetic data generation is widely recognized as an approach to improve the quality of neural grammatical error correction (GEC) systems. However, current approaches often lack diversity or are overly simplistic in generating the wide range of grammatical errors made by humans, particularly for low-resource languages such as Arabic. In this study, we developed an error tagging model and a synthetic data generation model to generate a large synthetic dataset in Arabic for GEC. In the error tagging model, the correct sentence is classified into multiple error types that humans are expected to make using the DeBERTav3 model. The Arabic Error Type Annotation (ARETA) tool is used to guide multi-label classification tasks in an error tagging model that divides each sentence into 26 error tags. The synthetic data generation model is a back-translation-based model that generates incorrect sentences by appending error tags before the correct sentence that was generated by the error tagging model using the AraT5 model. In the QALB-14 and QALB-15 test sets, the error tagging model achieved 94.42% F1, which is a state-of-the-art result in identifying error tags in clean sentences. As a result of our syntactic data training in GEC, we achieved a new state-of-the-art result with F1-score of 79.36% in the QALB-14 test set. We generated 30,219,310 synthetic sentence pairs using a synthetic data generation model. Our data are accessible to the public.*.

Towards the Development of Balanced Synthetic Data for Correcting Grammatical Errors in Arabic: An Approach Based on Error Tagging Model and Synthetic Data Generation Model

Key Points

Abstract

Cite This Study

Also Consider

Also Consider