August 19, 2024Open Access

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

Key Points

Key points are not available for this paper at this time.

Abstract

Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Most evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required, such as health, and where misleading or incorrect answers can have a significant impact on a user's health. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking signals as a substitute for explicit relevance judgements. Our scoring method correlates with the preferences of human experts. We validate it by investigating the well-known fact that the quality of generated answers improves with the size of the model as well as with more sophisticated prompting strategies.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Heineking et al. (Mon,) studied this question.

synapsesocial.com/papers/68e5bd35b6db643587554ea8 https://doi.org/https://doi.org/10.48550/arxiv.2408.09831

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper