What type of study is this?

This is a Quantitative Study study.

September 16, 2025

Google Translate or ChatGPT-4? A Multi-Metric Evaluation of Chinese-to-English Technical Translation

Key Points

Human evaluators preferred ChatGPT-4 for technical manual translations over GNMT, suggesting a perceived quality advantage.
Automatic metrics favored GNMT, revealing inconsistencies in evaluation alignment with human judgments across translation outputs.
Statistical analyses indicated varied correlation patterns, with semantic metrics showing limited correlation in GNMT evaluations.
Findings emphasize the need for context-sensitive evaluation methods in LLM-driven machine translation assessments.

Abstract

The advent of large language models (LLMs), such as ChatGPT, has opened new avenues for machine translation (MT), particularly in specialised domains such as technical documentation. However, their performance, relative to neural MT systems like Google Neural Machine Translation (GNMT), lacks empirical validation for the Chinese-English language pair. This study aims to compare the Chinese-English translation quality of GNMT and ChatGPT-4 in technical manuals, evaluate the variability of six widely used automatic metrics, and examine their correlation with human assessment. A parallel bilingual corpus of eighty aligned segments from technical manuals was constructed. Translations generated by GNMT and ChatGPT-4 were evaluated using standard automatic lexical metrics (BLEU, METEOR, and CHRF), semantic metrics (BLEURT, BERTScore, and COMET-QE), and human assessments. Statistical analyses employed paired t-tests, Wilcoxon signed-rank tests, Friedman tests with Wilcoxon post hoc comparisons, and Spearman correlations. The results showed that human evaluators preferred ChatGPT-4 over GNMT for technical manual translation, whereas all automatic metrics favoured GNMT. Automatic evaluation revealed notable inconsistencies, with partial alignment observed in COMET-QE-related comparisons. Correlation patterns differed across systems: only semantic metrics exhibited limited correlations with human assessments for GNMT. In contrast, for ChatGPT-4, lexical metrics exhibited moderate to low correlations, whereas semantic metrics demonstrated no meaningful association. These findings highlight ChatGPT-4’s advantage in human-judged translation quality, while also underscoring the misalignment between automatic metrics and human assessments in LLM-based machine translation, thereby reinforcing the need for more context-sensitive and adaptive evaluation approaches.

Bookmark

Google Translate or ChatGPT-4? A Multi-Metric Evaluation of Chinese-to-English Technical Translation

Key Points

Abstract

Cite This Study

Also Consider

Also Consider