Key points are not available for this paper at this time.
In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality remains challenging, as existing methods often overlook generative task complexities. This work aimed to examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well-validated baseline with which to examine the alignment of these metrics, we created a comprehensive human evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score-a Unified Medical Language System (UMLS)- showed the best results. This underscores the importance of incorporating domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics, particularly focusing on refining the SapBERT score for improved assessments.
Building similarity graph...
Analyzing shared references across papers
Loading...
Emma Croxford
University of Wisconsin–Madison
Yanjun Gao
University of Colorado Anschutz Medical Campus
Brian W. Patterson
University of Wisconsin System
University of Wisconsin–Madison
Loyola University Chicago
Building similarity graph...
Analyzing shared references across papers
Loading...
Croxford et al. (Thu,) studied this question.
synapsesocial.com/papers/68e73082b6db6435876a9b0b — DOI: https://doi.org/10.1101/2024.03.20.24304620
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: