Key points are not available for this paper at this time.
Recently, the advent of large language models (LLMs) has revolutionized generative agents.Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users.However, the absence of a comprehensive benchmark impedes progress in this field.To bridge this gap, we introduce Char-acterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset.The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 11,376 examples and featuring 77 characters derived from Chinese novels and scripts.It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike.Charac-terEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions.To facilitate the convenient evaluation for these subjective metrics in CharacterEval, we further developed Charac-terRM, a role-playing reward model based on human annotations, which has a higher correlation with human judgment compared to GPT-4.Comprehensive experiments on Char-acterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation.Source code, data source, and reward model will be publicly accessible at https://github.com/ morecry/CharacterEval.
Tu et al. (Mon,) studied this question.