Fine-grained word-level Quality Estimation (QE), such as Multidimensional Quality Metrics (MQM), provides error annotations that can enhance Automatic Post-Editing (APE) and APE evaluation. We studied a two-stage QE-assisted APE pipeline: a QE model tags error spans and a post-editor refines the translation conditioned on these annotations. We improved QE decoding using Minimum Bayes Risk (MBR) decoding. In addition, we introduce Edit Agreement Test-F1, a novel metric that measures over- or undercorrection by comparing QE predictions against gold annotations.We expanded two MQM datasets with post-edited translations generated using GPT-4.Experiments in four translation directions show that accurate word-level QE improves translation quality and that our MBR-enhanced QE models outperform state-of-the-art baselines.
Lin et al. (Thu,) studied this question.