This paper explores the problem of fake news detection in the Kazakh language under a limited resource environment. A manually annotated corpus was developed to support analysis. The corpus contains 1106 news stories and social media posts, of which 591 are true and 515 are false. 1081 texts are written in Kazakh. 3 experimental configurations were examined. In Experiment 1, when an English fact-checking model was applied to an automatically translated Kazakh article, the F1 score decreased by 49.7%, revealing the seriousness of the semantic distortion caused by the translation. In Experiment 2, we constructed a Kazakh baseline using only TF-IDF features using logistic regression, and achieved an accuracy of 0.65 and an F1 score of 0.63. This suggests that lexical cues provide some signal for distinguishing between real and fake news, but the overall discriminatory power remains limited. The final configuration employed a multilingual contextual transformer (XLM-RoBERTa) trained directly with Kazakh data. It reproduced the reference performance (F1 = 0.97) and, at the same time, produced a modest improvement in recovery. These results demonstrate that multilingual models can effectively process Kazakh without translation, while translation processes introduce significant semantic biases. The dataset and assessment framework serve as one of the early quantitative benchmarks for fake news detection in Kazakh, offering a basis for further work on cross-language disinformation.
Telman et al. (Thu,) studied this question.