Key points are not available for this paper at this time.
We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chia‐Wei Liu
Ryan Lowe
Iulian Vlad Serban
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Fri,) studied this question.
www.synapsesocial.com/papers/6a0a55e38e4d6c8168574170 — DOI: https://doi.org/10.48550/arxiv.1603.08023
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: