Key points are not available for this paper at this time.
In DUC 2005, the pyramid method for content evaluation was used for the first time in a crosssite evaluation. We discuss the method used in creating pyramid models and performing peer annotation. Analysis of score averages for the peers indicates that the best systems score half as well as humans, and that systems can be grouped into better and worse performers. There were few significant differences among systems. High score correlations between sets from different annotators, and good interannotator agreement, indicate that participants can perform annotation reliably. We found that a modified pyramid score gave good results and would simplify peer annotation in the future.
McKeown et al. (Sat,) studied this question.