Key points are not available for this paper at this time.
Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have employed proprietary close-source models, especially GPT4, as the evaluator. Alternatively, other works have fine-tuned judge models based on open-source LLMs as the evaluator. In this study, we conduct an empirical study of different judge models on their evaluation capability. Our findings indicate that although the fine-tuned judge models achieve high accuracy on in-domain test sets, even surpassing GPT4, they are inherently task-specific classifiers, and their generalizability and fairness severely underperform GPT4.
Building similarity graph...
Analyzing shared references across papers
Loading...
Huang et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68e758cbb6db6435876d0938 — DOI: https://doi.org/10.48550/arxiv.2403.02839
Hui Huang
Yingqi Qu
Jing Liu
Building similarity graph...
Analyzing shared references across papers
Loading...