What question did this study set out to answer?

The research aims to develop and validate a framework for Named Entity Recognition in low-resource languages, focusing on Yambeta.

February 12, 2026Open Access

LinguoNER: A Language-Agnostic Framework for Named Entity Recognition in Low-Resource Languages with a Focus on Yambeta

Key Points

The research aims to develop and validate a framework for Named Entity Recognition in low-resource languages, focusing on Yambeta.
Implemented an end-to-end workflow for NER, including corpus acquisition and annotation.
Used a Bible-derived corpus, creating a publicly available Yambeta NER dataset with ≈25,000 tokens.
Developed a Yambeta WordPiece tokenizer to maintain diacritics and tone markers.
Fine-tuned a bert-base-cased transformer for token classification, tested on held-out data.
Achieved strong token-level performance with Precision = 0.989, Recall = 0.981, F1 = 0.985.
Substantially outperformed a dictionary-only gazetteer baseline by ΔF1 ≈ 0.36.
Improvements were observed in precise token classification beyond surface matching.

Abstract

This paper presents LinguoNER, a practical and extensible framework for bootstrapping Named Entity Recognition (NER) in extremely low-resource languages, demonstrated on Yambeta, a Bantu language spoken by a minority community in Cameroon. Due to scarce digital resources and the absence of annotated corpora, Yambeta has remained largely underrepresented in Natural Language Processing (NLP). LinguoNER addresses this gap by providing a methodologically transparent end-to-end workflow that integrates corpus acquisition, gazetteer-driven automatic annotation, tokenizer training, transformer fine-tuning, and multi-level evaluation in settings where large-scale manual annotation is infeasible. Using a Bible-derived corpus as a linguistically stable starting point, we release the first publicly available Yambeta NER dataset (≈25,000 tokens) annotated with the CoNLL BIO scheme and a restricted entity schema (PER/LOC/ORG). Because labels are generated via dictionary-based annotation, the corpus is best characterized as silver-standard; credibility is strengthened through recorded dictionaries, transparency logs, expert-in-the-loop validation on sampled subsets, and complementary qualitative error analysis. We additionally train a dedicated Yambeta WordPiece tokenizer that preserves tone markers and diacritics, and fine-tune a bert-base-cased transformer for token classification. On a held-out test split, LinguoNER achieves strong token-level performance (Precision = 0.989, Recall = 0.981, F1 = 0.985), substantially outperforming a dictionary-only gazetteer baseline (ΔF1 ≈ 0.36). Per-entity-type evaluation further indicates improvements beyond surface-form matching, while remaining errors are linguistically motivated and primarily involve multi-word entity boundaries, agglutinative constructions, and tone-/diacritic-sensitive tokenization. We emphasize that results are restricted to a Bible domain and a limited label space, and should be interpreted as proof-of-concept evidence rather than claims of broad out-of-domain generalization. Overall, LinguoNER provides a reproducible blueprint for bootstrapping NER resources in underrepresented languages and supports future work on broader corpora sources (e.g., news, OPUS, JW300), additional African languages (e.g., Yoruba, Igbo, Bassa), and the iterative creation of expert-refined datasets and gold-standard subsets.

Bookmark

View Full Paper