What question did this study set out to answer?

This research aims to enhance fake news detection in the Kazakh language by utilizing multilingual approaches in a resource-limited context.

March 23, 2026Open Access

Cross-Lingual and Multilingual Approaches to Fake News Detection in the Kazakh Language

Key Points

This research aims to enhance fake news detection in the Kazakh language by utilizing multilingual approaches in a resource-limited context.
Developed a manually annotated corpus of 1106 news stories and social media posts.
Conducted three experimental configurations for fake news detection.
Applied an English fact-checking model to translated Kazakh articles.
Constructed a Kazakh baseline using TF-IDF features with logistic regression.
Employed a multilingual contextual transformer (XLM-RoBERTa) for direct Kazakh processing.
The English model showed a significant decrease in performance with a 49.7% drop in F1 score due to translation issues.
The Kazakh baseline achieved 65% accuracy and an F1 score of 0.63, indicating some effectiveness in distinguishing real from fake news.
The multilingual model reached an F1 score of 0.97, demonstrating effective processing of Kazakh without translation.

Abstract

This paper explores the problem of fake news detection in the Kazakh language under a limited resource environment. A manually annotated corpus was developed to support analysis. The corpus contains 1106 news stories and social media posts, of which 591 are true and 515 are false. 1081 texts are written in Kazakh. 3 experimental configurations were examined. In Experiment 1, when an English fact-checking model was applied to an automatically translated Kazakh article, the F1 score decreased by 49.7%, revealing the seriousness of the semantic distortion caused by the translation. In Experiment 2, we constructed a Kazakh baseline using only TF-IDF features using logistic regression, and achieved an accuracy of 0.65 and an F1 score of 0.63. This suggests that lexical cues provide some signal for distinguishing between real and fake news, but the overall discriminatory power remains limited. The final configuration employed a multilingual contextual transformer (XLM-RoBERTa) trained directly with Kazakh data. It reproduced the reference performance (F1 = 0.97) and, at the same time, produced a modest improvement in recovery. These results demonstrate that multilingual models can effectively process Kazakh without translation, while translation processes introduce significant semantic biases. The dataset and assessment framework serve as one of the early quantitative benchmarks for fake news detection in Kazakh, offering a basis for further work on cross-language disinformation.

Cross-Lingual and Multilingual Approaches to Fake News Detection in the Kazakh Language

Key Points

Abstract

Cite This Study