April 25, 2022

Named-entity recognition for a low-resource language using pre-trained language model

Key Points

Key points are not available for this paper at this time.

Abstract

This paper proposes a method for Named-Entity Recognition (NER) for a low-resource language, Tigrinya, using a pre-trained language model. Tigrinya is a morphologically rich language, although one of the underrepresented in the field of NLP. This is mainly due to the limited amount of annotated data available. To address this problem, we introduced the first publicly available NER dataset for Tigrinya. The dataset contains 69,309 tokens that were manually annotated based on the CoNLL 2003 Beginning, Inside, and Outside (BIO) tagging schema. Specifically, we develop a new pre-trained language model for Tigrinya based on RoBERTa, which we refer to as TigRoBERTa. First, It is trained on an unsupervised Tigrinya corpus using Masked Language Modeling (MLM). Then, we show the validity of TigRoBERTa by fine-tuning for a couple of downstream tasks, namely, NER and Part of Speech (POS) tagging. The experimental results show that the method achieved 81.05% F1-score for NER and 92% accuracy for POS tagging, which is better than or comparable to the baseline method based on the CNN-BiLSTM-CRF model.

KI fragen

Bookmark