This paper addresses the lack of bilingual annotated resources for automatic fake news detection in the Kazakh–Russian media space, as well as the limitations of binary annotation, which does not always allow disinformation to be represented as a complex and interpretable phenomenon. The aim of the study is to develop KazFakeCorpus and propose a multi-level annotation scheme that captures not only the final veracity of a message, but also the type of fake content, the disinformation technique, communicative intent, modality, and the characteristics of the source and evidence base. The corpus was constructed on the basis of official news materials published on the Gov.kz portal for the REAL class and synthetically generated messages for the FAKE class, complemented by an external validation set of authentic fake news from independent fact-checking sources to assess generalization. After data collection, the texts underwent cleaning, normalization, balancing, and sampling. The final resource includes 4276 texts in Kazakh and Russian, with an average length of approximately 200 words and a balanced distribution across languages and classes. Annotation was carried out in the Label Studio environment by two independent experts: a linguist and a fact-checking specialist. Before the main annotation phase, a pilot study was conducted on a subsample of 120 texts, the results of which were used to refine the categories and prepare the annotation guidelines. Krippendorff’s alpha was used to assess inter-annotator agreement; the obtained values, ranging from 0.79 to 0.88, indicate sufficient stability of the annotation across the key categories. The corpus analysis showed that misattribution (32.5%) is the most frequent disinformation technique, followed by clickbait (23.0%) and emotional pressure (16.4%). The results show that the proposed scheme makes it possible to treat fake news not only as a binary class but also as a multi-level semantic object that includes mechanisms of information distortion and features of content presentation. The practical contribution of the study lies in the creation of a bilingual corpus and annotation protocol that can be used in disinformation research, interpretable text analysis, and cross-lingual studies.
Lamasheva et al. (Mon,) studied this question.