With the increasing demand for cross-lingual natural language processing (CLNLP), bilingual dictionary induction and machine translation methods based on large-scale parallel corpora face challenges such as data scarcity, difficulty in domain transfer, and high annotation costs. This paper proposes a method for constructing bilingual dictionary induction and translation models based on a combination of unsupervised learning and self-supervised learning. First, using the monolingual corpus before alignment, a word vector embedding model is trained in a self-supervised manner to capture the internal semantic structure of the language. Second, an unsupervised mapping algorithm is used to perform distribution alignment between the embedding spaces of different languages, and cycle consistency and pseudo-parallel data generation strategies are used to improve the robustness of the mapping. Then, based on the pseudo-bilingual dictionary obtained by induction, a lightweight neural machine translation (NMT) model is constructed to achieve automatic translation from the source language to the target language. Experiments were conducted on a mixed subset of open news data and Wikipedia corpus. The results showed that the method achieved a Top-1 accuracy of 91.2% with a confidence level of 0.78 in the dictionary-guided task, and improved the BLEU score by an average of 4.5 points in the low-resource translation task, demonstrating its effectiveness and generalization ability in resource-constrained scenarios.
Xinyue Kang (Thu,) studied this question.