Los puntos clave no están disponibles para este artículo en este momento.
In natural language processing (NLP) an-notation projects, we use inter-annotator agreement measures and annotation guide-lines to ensure consistent annotations. However, annotation guidelines often make linguistically debatable and even somewhat arbitrary decisions, and inter-annotator agreement is often less than perfect. While annotation projects usu-ally specify how to deal with linguisti-cally debatable phenomena, annotator dis-agreements typically still stem from these “hard ” cases. This indicates that some er-rors are more debatable than others. In this paper, we use small samples of doubly-annotated part-of-speech (POS) data for Twitter to estimate annotation reliability and show how those metrics of likely inter-annotator agreement can be implemented in the loss functions of POS taggers. We find that these cost-sensitive algorithms perform better across annotation projects and, more surprisingly, even on data an-notated according to the same guidelines. Finally, we show that POS tagging mod-els sensitive to inter-annotator agreement perform better on the downstream task of chunking. 1
Plank et al. (Wed,) studied this question.