Key points are not available for this paper at this time.
We are surprised to find that BERT's peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. However, we show that this result is entirely accounted for by exploitation of spurious statistical cues in the dataset. We analyze the nature of these cues and demonstrate that a range of models all exploit them. This analysis informs the construction of an adversarial dataset on which all models achieve random accuracy. Our adversarial dataset provides a more robust assessment of argument comprehension and should be adopted as the standard in future work.
Building similarity graph...
Analyzing shared references across papers
Loading...
Niven et al. (Tue,) studied this question.
www.synapsesocial.com/papers/6a079f953d01ce3fbe8b3de3 — DOI: https://doi.org/10.18653/v1/p19-1459
Timothy Niven
Hung‐Yu Kao
National Cheng Kung University
Building similarity graph...
Analyzing shared references across papers
Loading...