What question did this study set out to answer?

This study aims to construct bilingual dictionary induction and translation models using unsupervised and self-supervised learning methods.

June 4, 2026Open Access

Construction of a Bilingual Dictionary-Induced Translation Model Based on Unsupervised and Self-Supervised Learning

Key Points

This study aims to construct bilingual dictionary induction and translation models using unsupervised and self-supervised learning methods.
Trained a word vector embedding model using a monolingual corpus in a self-supervised manner.
Utilized an unsupervised mapping algorithm for distribution alignment between languages.
Constructed a lightweight neural machine translation model based on a pseudo-bilingual dictionary.
Achieved a Top-1 accuracy of 91.2% with a confidence level of 0.78 in the dictionary-guided task.
Enhanced the BLEU score by an average of 4.5 points in low-resource translation tasks.

Abstract

With the increasing demand for cross-lingual natural language processing (CLNLP), bilingual dictionary induction and machine translation methods based on large-scale parallel corpora face challenges such as data scarcity, difficulty in domain transfer, and high annotation costs. This paper proposes a method for constructing bilingual dictionary induction and translation models based on a combination of unsupervised learning and self-supervised learning. First, using the monolingual corpus before alignment, a word vector embedding model is trained in a self-supervised manner to capture the internal semantic structure of the language. Second, an unsupervised mapping algorithm is used to perform distribution alignment between the embedding spaces of different languages, and cycle consistency and pseudo-parallel data generation strategies are used to improve the robustness of the mapping. Then, based on the pseudo-bilingual dictionary obtained by induction, a lightweight neural machine translation (NMT) model is constructed to achieve automatic translation from the source language to the target language. Experiments were conducted on a mixed subset of open news data and Wikipedia corpus. The results showed that the method achieved a Top-1 accuracy of 91.2% with a confidence level of 0.78 in the dictionary-guided task, and improved the BLEU score by an average of 4.5 points in the low-resource translation task, demonstrating its effectiveness and generalization ability in resource-constrained scenarios.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper