The relevance of this research is driven by the growing need to improve the efficiency of semantic information retrieval amid the rapid expansion of text data, particularly in low-resource languages such as Kazakh. The purpose of the research is to develop a justified approach for selecting and comparing text vectorization models used in intelligent search systems, considering the morphological and syntactic features of the Kazakh language, and to construct a mathematical model for computing semantic similarity in a multidimensional vector space. The methodology is based on the empirical testing of six models (TF-IDF, Word2Vec, FastText, GloVe, BERT, and KazBERT) on a corpus of 24,000 Kazakh texts. Vectorization was performed using CLS-tokens; morphological preprocessing employed the Kaznlp tool. Model effectiveness was assessed using precision, recall, and F1-score metrics. The results demonstrated that KazBERT, combined with morphological analysis, achieved the highest accuracy in handling variable word forms, outperforming multilingual BERT by 11–15% and TF-IDF by over 30%. FastText showed strong resilience to morphological variation but was less effective with syntactically complex queries. The scientific novelty lies in the development of a hybrid model for intelligent search adapted to the agglutinative nature of the Kazakh language, and in the introduction of a custom morpho-syntactic metric that increases sensitivity to grammatical features. The conclusions confirm that adapting vector models to account for grammar significantly enhances retrieval relevance.
Sadykova et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: