April 22, 2026Open Access

Evaluating large language models for accuracy incentivizes hallucinations

Key Points

Key points are not available for this paper at this time.

Abstract

Abstract Large language models sometimes produce confident, plausible falsehoods (‘hallucinations’), limiting their reliability 1,2 . Previous work has offered numerous explanations and effective mitigations such as retrieval and tool use 3 , consistency-based self-verification 4 and reinforcement learning from human feedback 5 . Nonetheless, the problem persists even in state-of-the-art language models 6,7 . Here we show how next-word prediction and accuracy-based evaluations inadvertently reward unwarranted guessing. Initially, next-word pretraining creates statistical pressure towards hallucination even with idealized error-free data: using learning theory 8,9 , we show that facts lacking repeated support in training data (such as one-off details) yield unavoidable errors, whereas recurring regularities (such as grammar) do not. Subsequent training stages aim to correct such errors. However, dominant headline metrics such as accuracy systematically reward guessing over admitting uncertainty. To align incentives, we suggest two additions to the classic approach of adding error penalties to evaluations to control abstention 10,11 . First, we propose ‘open rubric’ evaluations that explicitly state how errors are penalized (if at all), which test whether a model modulates its abstentions to stated stakes while optimizing accuracy. Second, as hallucination-specific benchmarks rarely make leaderboards 12 , we suggest using open-rubric variants of existing evaluations, to reverse their guessing incentives. Reframing hallucination as an incentive problem opens a practical path towards more reliable language models.

Evaluating large language models for accuracy incentivizes hallucinations

Key Points

Abstract

Cite This Study