What does this research mean for the field?

Currently recommended ROUGE variants for evaluating summarization systems are suboptimal, and the machine translation metric BLEU performs on-par with ROUGE when evaluated using a novel, statistically rigorous methodology. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

January 1, 2015Open Access

Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE

Key Points

Key points are not available for this paper at this time.

Abstract

We provide an analysis of current evaluation methodologies applied to summarization metrics and identify the following areas of concern: (1) movement away from evaluation by correlation with human assessment; (2) omission of important components of human assessment from evaluations, in addition to large numbers of metric variants; (3) absence of methods of significance testing improvements over a baseline. We outline an evaluation methodology that overcomes all such challenges, providing the first method of significance testing suitable for evaluation of summarization metrics. Our evaluation reveals for the first time which metric variants significantly outperform others, optimal metric variants distinct from current recommended best variants, as well as machine translation metric BLEU to have performance on-par with ROUGE for the purpose of evaluation of summarization systems. We subsequently replicate a recent large-scale evaluation that relied on, what we now know to be, suboptimal ROUGE variants revealing distinct conclusions about the relative performance of state-of-the-art summarization systems.

Mark Helpful

Bookmark

Relay

View Full Paper