Datasets Collection Framework for Low-Resourced Languages in South Africa

Key Points

Key points are not available for this paper at this time.

Abstract

The linguistic diversity in South Africa presents a unique challenge for Natural Language Processing (NLP) applications, as many of the country's languages are considered low-resourced. Eight of the eleven official languages of South Africa currently lack sufficient documentation and resources, apart from English, Afrikaans, and isiZulu, which house the majority of the reported datasets. This paper introduces a comprehensive data collection framework specifically tailored to address the scarcity of linguistic resources for the underrepresented languages in South Africa. The framework consists of a Language Identification (LI) model embedded in a database portal that is used to gather text data, label it, and store it in the database for future usage and LI model retraining. In addition, different machine learning classifiers were compared for their effectiveness in LI tasks. The best-performing classifier was then utilized for the proof-of-concept implementation. It is anticipated that collecting such resources will foster greater inclusivity, enabling the development of language technologies that cater to a broader linguistic landscape and promote cultural preservation in the digital era. This work contributes to the broader efforts in preserving linguistic diversity and promoting inclusive technological solutions in multilingual societies.

Mark Helpful

Bookmark

Relay

Mark Helpful

Bookmark

Relay

Datasets Collection Framework for Low-Resourced Languages in South Africa

Key Points

Abstract

Cite This Study