April 25, 2024Open Access

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Key Points

Key points are not available for this paper at this time.

Abstract

Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Ailem et al. (Thu,) studied this question.

synapsesocial.com/papers/68e6dac2b6db643587657730 https://doi.org/https://doi.org/10.48550/arxiv.2404.16966

KI fragen

Bookmark

View Full Paper