What question did this study set out to answer?

The study aims to create KazFakeCorpus, a bilingual resource for automated fake news detection, featuring multi-level semantic annotations.

June 3, 2026Open Access

KazFakeCorpus: A Bilingual Corpus with Multi-Level Semantic Annotation for Fake News Detection

Key Points

The study aims to create KazFakeCorpus, a bilingual resource for automated fake news detection, featuring multi-level semantic annotations.
Constructed a bilingual corpus with 4276 texts in Kazakh and Russian from the Gov.kz portal and synthetic messages.
Conducted annotation in the Label Studio by linguist and fact-checking expert; used pilot study to refine categories.
Assessed inter-annotator agreement using Krippendorff’s alpha, achieving values of 0.79 to 0.88.
Misattribution (32.5%) was identified as the most frequent disinformation technique, followed by clickbait (23.0%) and emotional pressure (16.4%).
The multi-level annotation scheme effectively represents fake news as a complex phenomenon rather than a binary classification.
The corpus and protocol are valuable for disinformation research and cross-lingual studies.

Abstract

This paper addresses the lack of bilingual annotated resources for automatic fake news detection in the Kazakh–Russian media space, as well as the limitations of binary annotation, which does not always allow disinformation to be represented as a complex and interpretable phenomenon. The aim of the study is to develop KazFakeCorpus and propose a multi-level annotation scheme that captures not only the final veracity of a message, but also the type of fake content, the disinformation technique, communicative intent, modality, and the characteristics of the source and evidence base. The corpus was constructed on the basis of official news materials published on the Gov.kz portal for the REAL class and synthetically generated messages for the FAKE class, complemented by an external validation set of authentic fake news from independent fact-checking sources to assess generalization. After data collection, the texts underwent cleaning, normalization, balancing, and sampling. The final resource includes 4276 texts in Kazakh and Russian, with an average length of approximately 200 words and a balanced distribution across languages and classes. Annotation was carried out in the Label Studio environment by two independent experts: a linguist and a fact-checking specialist. Before the main annotation phase, a pilot study was conducted on a subsample of 120 texts, the results of which were used to refine the categories and prepare the annotation guidelines. Krippendorff’s alpha was used to assess inter-annotator agreement; the obtained values, ranging from 0.79 to 0.88, indicate sufficient stability of the annotation across the key categories. The corpus analysis showed that misattribution (32.5%) is the most frequent disinformation technique, followed by clickbait (23.0%) and emotional pressure (16.4%). The results show that the proposed scheme makes it possible to treat fake news not only as a binary class but also as a multi-level semantic object that includes mechanisms of information distortion and features of content presentation. The practical contribution of the study lies in the creation of a bilingual corpus and annotation protocol that can be used in disinformation research, interpretable text analysis, and cross-lingual studies.

KazFakeCorpus: A Bilingual Corpus with Multi-Level Semantic Annotation for Fake News Detection

Key Points

Abstract

Cite This Study