Key points are not available for this paper at this time.
Natural language processing (NLP) is rapidly developing. A series of Large Language Models (LLMs) have emerged, represented by ChatGPT, which have made significant breakthroughs in natural language understanding and generation, enabling fluent dialogue with humans, understanding human intentions, and completing complex tasks. However, in addition to the fairness and toxicity of traditional language models, some new problems, including hallucination, have also emerged in LLMs, making them hard to use. Evaluating LLMs manually is challenging due to subjectivity and inefficiency. In this paper, we focused on the fuzzy matching, toxicity detection, and hallucination detection in the evaluation of LLMs automatically, and fine-tune the Mixtral-8x7B Model, which can be deployed in private cloud environment, and prove the effectiveness of our method through experiments.
Ding et al. (Fri,) studied this question.