Quantization has become a key technique for the compression and acceleration of large language models (LLMs). Although research into low-bit quantization is actively advancing for English-language LLMs, its impact on morphologically rich and resource-diverse languages, including Russian, remains far less studied. Therefore, additional research into this problem is required, driven by the development of high-performance Russian-language and multilingual LLMs. We have conducted a systematic study of quantizing pretrained models to 2.0–4.25 bits per parameter for modern Russian-language LLMs at various scales, ranging from 4 to 32 billion parameters (4B and 32B). Our experimental setup covers both standard uniform quantization and specialized low-bit formats. Our findings highlight several key trends: (i) the tolerance of Russian-language LLMs to quantization varies across model architectures and sizes; (ii) 4-bit quantization demonstrates high robustness, particularly when advanced formats are employed; (iii) 3-bit and 2-bit quantizations prove to be the most sensitive to calibration data and scaling strategies. Empirical results show that the model’s domain must be considered when employing different quantization techniques.
Poimanov et al. (Mon,) studied this question.