January 1, 2014Open Access

Testing for Significance of Increased Correlation with Human Judgment

Key Points

Key points are not available for this paper at this time.

Abstract

Automatic metrics are widely used in ma-chine translation as a substitute for hu-man assessment. With the introduction of any new metric comes the question of just how well that metric mimics human assessment of translation quality. This is often measured by correlation with hu-man judgment. Significance tests are gen-erally not used to establish whether im-provements over existing methods such as BLEU are statistically significant or have occurred simply by chance, however. In this paper, we introduce a significance test for comparing correlations of two metrics, along with an open-source implementation of the test. When applied to a range of metrics across seven language pairs, tests show that for a high proportion of metrics, there is insufficient evidence to conclude significant improvement over BLEU. 1

KI fragen

Bookmark

View Full Paper