May 16, 2024Open Access

Language Models can Evaluate Themselves via Probability Discrepancy

Key Points

Key points are not available for this paper at this time.

Abstract

In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4, spanning a range of scenarios that include natural language generation (NLG) tasks such as translation, summarization, and our proposed Xiaohongshu blog writing task, and benchmarks for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of varying magnitudes.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper

Cite This Study

Xia et al. (Thu,) studied this question.

synapsesocial.com/papers/68e69d5db6db643587622c07 https://doi.org/https://doi.org/10.48550/arxiv.2405.10516

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AI से पूछें

Bookmark

View Full Paper