Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages and the monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, improve cross-lingual performance but often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task in the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM’s embedding space, improving both language generation and comprehension. We further introduce a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves up to 11.9 BLEU score gains in MT, an increase of 6.72 in CLQA BERTScore-Precision and more than a 5% increase in CLNLU accuracy over strong multilingual baselines. Our findings highlight the potential of embedding cross-lingual objectives into pre-training, improving multilingual LLMs.
Building similarity graph...
Analyzing shared references across papers
Loading...
Weihua Zheng
Chang Liu
Zhengyuan Liu
ACM Transactions on Asian and Low-Resource Language Information Processing
Japan Science and Technology Agency
Agency for Science, Technology and Research
Singapore University of Technology and Design
Building similarity graph...
Analyzing shared references across papers
Loading...
Zheng et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69edac794a46254e215b434f — DOI: https://doi.org/10.1145/3811819