We present an MQM (Multidimensional Quality Metrics) evaluation of TranslateGemma, a 12-billion-parameter open-source translation model, across 16 target languages — 12 officially supported and 4 unsupported. The evaluation was designed as a stress test: the source material is a technically dense academic paper in computational linguistics, and the target language set includes low-resource and unsupported languages. Forty-five professional linguists annotated 322 segments, producing 1,169 error annotations across three severity levels. Our findings reveal a 6.2-fold quality gap between supported and unsupported languages (15.3 vs. 95.7 MQM penalty per 100 words). Notably, Moroccan Arabic — an unsupported language — outperformed 10 of 12 supported languages. Inter-annotator agreement analysis shows that while annotators identify similar error quantities (σ=2.24), they disagree substantially on error categories (8% Jaccard overlap) and text spans (16.5% intersection over union), confirming the inherent subjectivity of MQM annotation at the granular level. We additionally benchmark six automatic evaluation metrics against human MQM scores. MetricX-24 XXL achieves the strongest correlation (r=0.88), while COMET-Kiwi XL provides a practical alternative (r=0.84). Infrastructure and model size significantly impact metric reliability.
Alexander Murauski (Sun,) studied this question.