January 1, 2019Open Access

Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

Key Points

Key points are not available for this paper at this time.

Abstract

The GLUE benchmark (Wang et al., 2019b) is a suite of language understanding tasks which has seen dramatic progress in the past year, with average performance moving from 70.0 at launch to 83.9, state of the art at the time of writing (May 24, 2019). Here, we measure human performance on the benchmark, in order to learn whether significant headroom remains for further progress. We provide a conservative estimate of human performance on the benchmark through crowdsourcing: Our annotators are non-experts who must learn each task from a brief set of instructions and 20 examples. In spite of limited training, these annotators robustly outperform the state of the art on six of the nine GLUE tasks and achieve an average score of 87.1. Given the fast pace of progress however, the headroom we observe is quite limited. To reproduce the datapoor setting that our annotators must learn in, we also train the BERT model

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Nangia et al. (Tue,) studied this question.

synapsesocial.com/papers/6a0898c9ad370a6b44de3564 https://doi.org/https://doi.org/10.18653/v1/p19-1449

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Demander à l'IA

Bookmark

View Full Paper