This study evaluates the translation capabilities of GPT-4o, a large language model (LLM), and Google Translate, a neural machine translation (NMT) system, using the American Translators Association (ATA) certification examination framework. We assess translations in two high-resource language pairs: English-to-Chinese (eng-chi) and English-to-Arabic (eng-ara). The evaluation combines both automatic metrics using COMET and manual assessment by ATA-certified graders following the standardized ATA grading framework. Two source texts from retired ATA certification exams were translated by both systems, producing eight target texts in total. Our findings indicate varying performance across systems and language pairs, with only GPT-4o’s eng-ara translations achieving superior quality for both required texts. Error analysis reveals distinct patterns between systems and language pairs: GPT-4o’s eng-chi translations primarily exhibit challenges with Terminology, Literalness, and Omission, while Google Translate shows a different distribution dominated by Cohesion issues, followed by Literalness and Misunderstanding. For eng-ara translations, both systems display similar error patterns, primarily in Terminology and Literalness, suggesting consistent challenges in this language pair. While COMET scores indicate high performance across all translations, manual assessment reveals more nuanced distinctions in translation quality, particularly in handling rhetorical expressions and idiomatic language use. These findings highlight the importance of complementing automatic metrics with human assessment in translation quality evaluation. The study also suggests that translation challenges extend beyond text complexity, reflecting distinct linguistic characteristics of each language pair and varying approaches in handling these challenges by different machine translation (MT) systems.
Zou et al. (Wed,) studied this question.