September 27, 2024

Annotation of Analytical Structures in Language Corpora

Key Points

Key points are not available for this paper at this time.

Abstract

This study analyzes the morphological and word-forming features of analytical structures in language markups by reviewing the annotations in Kazakh, Turkic, and Russian corpora. The findings reveal a significant degree of similarity among the corpora of leading Turkic languages, including Kazakh, Tatar, and Bashkir. The analysis shows that there are no substantial issues with annotating combined and paired words that machines recognize as single compound units. Consequently, these can be searched and annotated similarly to other compounds, with morphological, word-forming, and lexical annotations applied to the entire unit. However, phrases written with spaces cannot be searched as a single lemma, resulting in each item being annotated individually rather than as a unified compound. This limitation negatively impacts the functionality of the corpora. Conversely, the annotation of phraseological units within the national corpus of the Kazakh language as a cohesive whole is a notable advantage of this corpus. Improving the annotation of analytical forms and formants of nouns (degrees), verbs, and auxiliary words in accordance with the language structure will enhance corpus functionality. Although this task is complex, an analysis of the current lexical resources in the National Corpora of the Kazakh language, along with ongoing fundamental research, indicates that the gradual automation of analytical structure annotation is imminent.

Mark Helpful

Bookmark

Relay