The parallel corpus serves as the fundamental foundation for machine translation, upon which the attainment of superior translation quality relies. Consequently, machine translation necessitates the training of a substantial quantity of parallel sentence pairs. Nonetheless, a prevalent predicament persists, wherein the scarcity of high-quality, extensive parallel corpora plagues numerous language pairs. Taking inspiration from the data augmentation techniques employed in image classification, this paper presents a novel approach to enhance the efficacy of bilingual parallel sentence extraction through strategic modifications. This method efficiently filters out bilingual sentence pairs that exhibit similarities in topics, while selectively preserving genuinely parallel semantic pairs that possess the capability of being translated interchangeably. Based on our experimental findings, it is observed that the accuracy in obtaining bilingual sentences is significantly higher by nearly 10 percentage points compared to the baseline. To further validate our method, the acquired parallel sentences are employed to train a machine translation model, resulting in a noteworthy enhancement in translation quality when compared to the model trained on unfiltered data.
Ke et al. (Thu,) studied this question.