This paper presents LinguoNER, a practical and extensible framework for bootstrapping Named Entity Recognition (NER) in extremely low-resource languages, demonstrated on Yambeta, a Bantu language spoken by a minority community in Cameroon. Due to scarce digital resources and the absence of annotated corpora, Yambeta has remained largely underrepresented in Natural Language Processing (NLP). LinguoNER addresses this gap by providing a methodologically transparent end-to-end workflow that integrates corpus acquisition, gazetteer-driven automatic annotation, tokenizer training, transformer fine-tuning, and multi-level evaluation in settings where large-scale manual annotation is infeasible. Using a Bible-derived corpus as a linguistically stable starting point, we release the first publicly available Yambeta NER dataset (≈25,000 tokens) annotated with the CoNLL BIO scheme and a restricted entity schema (PER/LOC/ORG). Because labels are generated via dictionary-based annotation, the corpus is best characterized as silver-standard; credibility is strengthened through recorded dictionaries, transparency logs, expert-in-the-loop validation on sampled subsets, and complementary qualitative error analysis. We additionally train a dedicated Yambeta WordPiece tokenizer that preserves tone markers and diacritics, and fine-tune a bert-base-cased transformer for token classification. On a held-out test split, LinguoNER achieves strong token-level performance (Precision = 0.989, Recall = 0.981, F1 = 0.985), substantially outperforming a dictionary-only gazetteer baseline (ΔF1 ≈ 0.36). Per-entity-type evaluation further indicates improvements beyond surface-form matching, while remaining errors are linguistically motivated and primarily involve multi-word entity boundaries, agglutinative constructions, and tone-/diacritic-sensitive tokenization. We emphasize that results are restricted to a Bible domain and a limited label space, and should be interpreted as proof-of-concept evidence rather than claims of broad out-of-domain generalization. Overall, LinguoNER provides a reproducible blueprint for bootstrapping NER resources in underrepresented languages and supports future work on broader corpora sources (e.g., news, OPUS, JW300), additional African languages (e.g., Yoruba, Igbo, Bassa), and the iterative creation of expert-refined datasets and gold-standard subsets.
Tamla et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: