Los puntos clave no están disponibles para este artículo en este momento.
Abstract Large language models sometimes produce confident, plausible falsehoods (‘hallucinations’), limiting their reliability 1,2 . Previous work has offered numerous explanations and effective mitigations such as retrieval and tool use 3 , consistency-based self-verification 4 and reinforcement learning from human feedback 5 . Nonetheless, the problem persists even in state-of-the-art language models 6,7 . Here we show how next-word prediction and accuracy-based evaluations inadvertently reward unwarranted guessing. Initially, next-word pretraining creates statistical pressure towards hallucination even with idealized error-free data: using learning theory 8,9 , we show that facts lacking repeated support in training data (such as one-off details) yield unavoidable errors, whereas recurring regularities (such as grammar) do not. Subsequent training stages aim to correct such errors. However, dominant headline metrics such as accuracy systematically reward guessing over admitting uncertainty. To align incentives, we suggest two additions to the classic approach of adding error penalties to evaluations to control abstention 10,11 . First, we propose ‘open rubric’ evaluations that explicitly state how errors are penalized (if at all), which test whether a model modulates its abstentions to stated stakes while optimizing accuracy. Second, as hallucination-specific benchmarks rarely make leaderboards 12 , we suggest using open-rubric variants of existing evaluations, to reverse their guessing incentives. Reframing hallucination as an incentive problem opens a practical path towards more reliable language models.
Kalai et al. (Wed,) studied this question.