July 31, 2024Open Access

Fine Tuning Language Models: A Tale of Two Low-Resource Languages

Key Points

Key points are not available for this paper at this time.

Abstract

A parallel corpus is an invaluable resource for machine translation. However, creating one is a challenging and time-consuming task. In the Philippines, where 185 languages are spoken, most have abundant text, but annotated data is scarce. Bikol is one of the major languages of thePhilippines, yet there have been only a few studies on this language. This study outlines the process of developing a parallel corpus of Bikol and Filipino texts curated from biblical text and Wikipedia articles, as well as translated Bikol songs from various sources. The corpus underwent refinement through manual phrase alignment and translation. Subsequently, T5 and mT5 transformer models were fine-tuned with the parallel corpus and were evaluated using the BLEU metric. A notable improvement in BLEU score was noted following fine-tuning, with an increase of 49.48 in Bik-Fil and 56.07 in Fil-Bik translation. Additionally, human evaluators comprehensively assessed the finetuned model's results using Multidimensional Quality Metrics and Scalar Quality Metrics error taxonomies. The fine-tuned models were made publicly accessible through Hugging Face. This study represents a significant stride in advancing machine translation tools for Bikol and Filipino languages.

Fine Tuning Language Models: A Tale of Two Low-Resource Languages

Key Points

Abstract

Cite This Study