May 1, 2026Open Access

Annotated Universal Dependencies Dataset for Literary and Educational Uzbek Texts

Key Points

Key points are not available for this paper at this time.

Abstract

This data article describes an Uzbek Universal Dependencies (UD) treebank released as a manually curated gold-standard dataset. The resource contains 681 sentences (7,542 tokens) drawn from literary and educational Uzbek texts, providing a domain-specific complement to previously available web-based or news-oriented materials 1 . Annotation was carried out in the INCEpTION environment 7 by a five-member team comprising three linguists and two NLP engineers. The workflow followed the UD v2 framework and included calibration-stage agreement assessment, full-corpus double annotation, and adjudication to improve annotation consistency. Agreement measured on the shared calibration material was high across lemmatization, universal part-of-speech annotation, and complete morphological feature–value bundles 9 . The released dataset contains final adjudicated gold-standard annotations, including lemmas, UPOS tags, morphological features, and basic dependency relations in standard CoNLL-U format, and has been validated for compatibility with the Universal Dependencies ecosystem. As an openly reusable Uzbek syntactic resource, it can support the development and evaluation of POS taggers, morphological analyzers, and dependency parsers, while also enabling comparative and cross-lingual studies for low-resource languages 10 .

Annotated Universal Dependencies Dataset for Literary and Educational Uzbek Texts

Key Points

Abstract

Cite This Study