What question did this study set out to answer?

This evaluation aims to assess the quality of TranslateGemma across multiple languages using MQM criteria.

March 18, 2026Open Access

MQM Quality Evaluation of TranslateGemma

Key Points

This evaluation aims to assess the quality of TranslateGemma across multiple languages using MQM criteria.
Conducted an MQM evaluation on TranslateGemma with 45 linguists annotating 322 segments.
Annotated translations included 16 target languages; 12 officially supported and 4 unsupported.
Measured errors across three severity levels and analyzed inter-annotator agreement.
Found a 6.2-fold quality gap between supported (95.7 penalty) and unsupported languages (15.3 penalty).
Moroccan Arabic outperformed 10 out of 12 supported languages in quality metrics.
MetricX-24 XXL showed the highest correlation with human evaluations (r=0.88).

Abstract

We present an MQM (Multidimensional Quality Metrics) evaluation of TranslateGemma, a 12-billion-parameter open-source translation model, across 16 target languages — 12 officially supported and 4 unsupported. The evaluation was designed as a stress test: the source material is a technically dense academic paper in computational linguistics, and the target language set includes low-resource and unsupported languages. Forty-five professional linguists annotated 322 segments, producing 1,169 error annotations across three severity levels. Our findings reveal a 6.2-fold quality gap between supported and unsupported languages (15.3 vs. 95.7 MQM penalty per 100 words). Notably, Moroccan Arabic — an unsupported language — outperformed 10 of 12 supported languages. Inter-annotator agreement analysis shows that while annotators identify similar error quantities (σ=2.24), they disagree substantially on error categories (8% Jaccard overlap) and text spans (16.5% intersection over union), confirming the inherent subjectivity of MQM annotation at the granular level. We additionally benchmark six automatic evaluation metrics against human MQM scores. MetricX-24 XXL achieves the strongest correlation (r=0.88), while COMET-Kiwi XL provides a practical alternative (r=0.84). Infrastructure and model size significantly impact metric reliability.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Alexander Murauski (Sun,) studied this question.

synapsesocial.com/papers/69ba44154e9516ffd37a5edd https://doi.org/https://doi.org/10.5281/zenodo.19046449

Bookmark

View Full Paper