What question did this study set out to answer?

This research aims to improve the performance of multilingual large language models on cross-lingual tasks by addressing data imbalances and monolingual bias in pre-training.

April 26, 2026Open Access

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

Puntos clave

This research aims to improve the performance of multilingual large language models on cross-lingual tasks by addressing data imbalances and monolingual bias in pre-training.
Introduced a Cross-Lingual Mapping Task during pre-training to enhance cross-lingual alignment.
Bi-directionally mapped languages in the LLM's embedding space to improve generation and comprehension.
Developed a Language Alignment Coefficient to quantify cross-lingual consistency in limited-data scenarios.
Achieved up to 11.9 BLEU score gains in machine translation.
Increased CLQA BERTScore-Precision by 6.72 points.
Improved CLNLU accuracy by more than 5% compared to strong multilingual baselines.

Resumen

Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages and the monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, improve cross-lingual performance but often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task in the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM’s embedding space, improving both language generation and comprehension. We further introduce a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves up to 11.9 BLEU score gains in MT, an increase of 6.72 in CLQA BERTScore-Precision and more than a 5% increase in CLNLU accuracy over strong multilingual baselines. Our findings highlight the potential of embedding cross-lingual objectives into pre-training, improving multilingual LLMs.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo