What question did this study set out to answer?

This research aims to enhance machine translation quality by improving bilingual parallel sentence extraction methods.

May 30, 2026

Exploring to mine parallel sentences for improving machine translation

Key Points

This research aims to enhance machine translation quality by improving bilingual parallel sentence extraction methods.
Utilized a novel approach based on data augmentation techniques from image classification.
Filtered bilingual sentence pairs by their topic similarities while retaining genuinely parallel pairs.
Conducted experiments to compare the efficacy of the new method against baseline parallel extraction.
Achieved a 10% increase in accuracy for bilingual sentence extraction compared to baseline.
Validated the method by training a machine translation model with the extracted sentences, resulting in improved translation quality.
Demonstrated that the quality of machine translation is significantly enhanced using the filtered data.

Abstract

The parallel corpus serves as the fundamental foundation for machine translation, upon which the attainment of superior translation quality relies. Consequently, machine translation necessitates the training of a substantial quantity of parallel sentence pairs. Nonetheless, a prevalent predicament persists, wherein the scarcity of high-quality, extensive parallel corpora plagues numerous language pairs. Taking inspiration from the data augmentation techniques employed in image classification, this paper presents a novel approach to enhance the efficacy of bilingual parallel sentence extraction through strategic modifications. This method efficiently filters out bilingual sentence pairs that exhibit similarities in topics, while selectively preserving genuinely parallel semantic pairs that possess the capability of being translated interchangeably. Based on our experimental findings, it is observed that the accuracy in obtaining bilingual sentences is significantly higher by nearly 10 percentage points compared to the baseline. To further validate our method, the acquired parallel sentences are employed to train a machine translation model, resulting in a noteworthy enhancement in translation quality when compared to the model trained on unfiltered data.

Mark Helpful

Bookmark

Relay