What type of study is this?

This is a Quantitative Study study.

October 2, 2025

Beyond automated metrics: Assessing GPT-4o and Google Translate against professional translation standards

Key Points

GPT-4o outperformed Google Translate for English-to-Arabic translations, indicating its superior capabilities.
Error analysis shows distinct challenges in translation quality, particularly in terminology and literalness.
High COMET scores across all translations highlight the limitations of automated metrics in capturing nuanced quality distinctions.
The study underscores the importance of human assessment alongside automatic metrics in evaluating translation performance.

Abstract

This study evaluates the translation capabilities of GPT-4o, a large language model (LLM), and Google Translate, a neural machine translation (NMT) system, using the American Translators Association (ATA) certification examination framework. We assess translations in two high-resource language pairs: English-to-Chinese (eng-chi) and English-to-Arabic (eng-ara). The evaluation combines both automatic metrics using COMET and manual assessment by ATA-certified graders following the standardized ATA grading framework. Two source texts from retired ATA certification exams were translated by both systems, producing eight target texts in total. Our findings indicate varying performance across systems and language pairs, with only GPT-4o’s eng-ara translations achieving superior quality for both required texts. Error analysis reveals distinct patterns between systems and language pairs: GPT-4o’s eng-chi translations primarily exhibit challenges with Terminology, Literalness, and Omission, while Google Translate shows a different distribution dominated by Cohesion issues, followed by Literalness and Misunderstanding. For eng-ara translations, both systems display similar error patterns, primarily in Terminology and Literalness, suggesting consistent challenges in this language pair. While COMET scores indicate high performance across all translations, manual assessment reveals more nuanced distinctions in translation quality, particularly in handling rhetorical expressions and idiomatic language use. These findings highlight the importance of complementing automatic metrics with human assessment in translation quality evaluation. The study also suggests that translation challenges extend beyond text complexity, reflecting distinct linguistic characteristics of each language pair and varying approaches in handling these challenges by different machine translation (MT) systems.

Mark Helpful

Bookmark

Relay