This article presents a comparative evaluation of machine translation quality across several large language models (LLMs), i.e., DeepSeek, Grok, Mistral, Qwen, GigaChat, and Yandex, based on translations of expressive linguistic means (phraseologisms, homonyms, puns, etc.) and texts of various functional styles. Translation quality is assessed quantitatively using coherence metrics (BLEU, METEOR, and chrF) and qualitatively through expert analysis based on adequacy, equivalence, and harmony criteria against reference translations, with additional comparison to Google Translate. The findings demonstrate that modern LLMs can overcome classical machine translation challenges and represent a new paradigm for developing human–AI hybrid systems.
Мыльникова et al. (Sun,) studied this question.