Key points are not available for this paper at this time.
Randomized methods of significance test-ing enable estimation of the probability that an increase in score has occurred sim-ply by chance. In this paper, we examine the accuracy of three randomized meth-ods of significance testing in the context of machine translation: paired bootstrap resampling, bootstrap resampling and ap-proximate randomization. We carry out a large-scale human evaluation of shared task systems for two language pairs to provide a gold standard for tests. Re-sults show very little difference in accu-racy across the three methods of signif-icance testing. Notably, accuracy of all test/metric combinations for evaluation of English-to-Spanish are so low that there is not enough evidence to conclude they are any better than a random coin toss. 1
Graham et al. (Wed,) studied this question.