Recent advances in machine learning and deep learning have demonstrated the applicability and utility of cross-lingual, transfer learning methods in low and zero-resource scenarios. We explore the applicability of transfer learning methods from pre-trained models in zero-shot and few-shot scenarios for part-of-speech tagging. We report the results of an ablation study to understand the impact of training data size in low-resource languages on the system’s performance. Since building or augmenting datasets for low-resource languages is tricky, costly and a lot of time not feasible, the study provides valuable insights into the expected relative data requirements for both the high-resource language (the source language for transfer) and the low-resource language and the kind of performance boost one could expect when one is planning to use transfer learning for low-resource languages. The study is conducted with Hindi as the high-resource language and the three related languages - Magahi, Bhojpuri and Braj - as extremely low-resource languages. Overall, the study addresses four broad research questions: (a) How much data in the low-resource as well as high-resource language is “sufficient” for attaining optimum performance in a downstream task like part-of-speech annotation, and is there any specific advantage for low-resource language if we use multilingual data during fine-tuning? (b) Do different multilingual pre-trained models, specifically multilingual-BERT, multilingual-DistilBERT, XLM-RoBERTa, and MuRIL, offer any significant advantage in terms of dataset requirements for attaining an optimum performance in Indian languages? (c) In the case of multiple closely-related low-resource languages, does distributing the dataset across multiple languages result in a performance comparable to that of a system trained on a single language? (d) What is the impact of the typological similarity of the languages on the dataset requirement for successful transfer learning?
Raj et al. (Fri,) studied this question.