Key points are not available for this paper at this time.
Previous work has shown that human evaluations in NLP are notoriously under-powered.
Howcroft et al. (Fri,) studied this question.