What question did this study set out to answer?

This research investigates the optimal data amount needed for effective transfer learning in low-resource languages.

December 12, 2025

How Much Data in Low-resource Indian Languages is "Sufficient' for Transfer Learning: A Comparative Study for POS Annotation

Key Points

This research investigates the optimal data amount needed for effective transfer learning in low-resource languages.
Comparative analysis of transfer learning techniques for part-of-speech tagging.
Ablation study investigating the effect of dataset size in low-resource and high-resource languages.
Evaluation of performance across multiple languages including Hindi, Magahi, Bhojpuri, and Braj.
Transfer learning enhances part-of-speech tagging in low-resource languages when sufficient data is provided.
The performance is dependent on both the amount of data available in target languages and the type of multilingual models used.

Abstract

Recent advances in machine learning and deep learning have demonstrated the applicability and utility of cross-lingual, transfer learning methods in low and zero-resource scenarios. We explore the applicability of transfer learning methods from pre-trained models in zero-shot and few-shot scenarios for part-of-speech tagging. We report the results of an ablation study to understand the impact of training data size in low-resource languages on the system’s performance. Since building or augmenting datasets for low-resource languages is tricky, costly and a lot of time not feasible, the study provides valuable insights into the expected relative data requirements for both the high-resource language (the source language for transfer) and the low-resource language and the kind of performance boost one could expect when one is planning to use transfer learning for low-resource languages. The study is conducted with Hindi as the high-resource language and the three related languages - Magahi, Bhojpuri and Braj - as extremely low-resource languages. Overall, the study addresses four broad research questions: (a) How much data in the low-resource as well as high-resource language is “sufficient” for attaining optimum performance in a downstream task like part-of-speech annotation, and is there any specific advantage for low-resource language if we use multilingual data during fine-tuning? (b) Do different multilingual pre-trained models, specifically multilingual-BERT, multilingual-DistilBERT, XLM-RoBERTa, and MuRIL, offer any significant advantage in terms of dataset requirements for attaining an optimum performance in Indian languages? (c) In the case of multiple closely-related low-resource languages, does distributing the dataset across multiple languages result in a performance comparable to that of a system trained on a single language? (d) What is the impact of the typological similarity of the languages on the dataset requirement for successful transfer learning?

Mark Helpful

Bookmark

Relay