Existing Role-Playing Agents (RPAs), powered by large language models, are predominantly evaluated on static, text-only, dyadic conversations, which inadequately reflect the complexity of realistic human interactions involving multiple interlocutors and multi-modal communication. To bridge this gap, we propose OmniCharacter++, the first benchmark for evaluating multi-character interactions in a joint text-speech context. Specifically, OmniCharacter++ contributes: (1) a large-scale dataset comprising 10,287 characters, 118,017 multi-turn dialogues, and over one million audio responses across 8 open-world topics and 31 subfields, covering diverse multi-modal role-playing scenarios; (2) a comprehensive evaluation suite for dialogue understanding, generation quality, and perceptual naturalness; and (3) UniCharacter-7B, a unified text-speech model trained on this dataset to manage complex multi-character dynamics, ensuring both role-specific vocal fidelity and cross-participant semantic alignment. Experimental results demonstrate that UniCharacter-7B achieves more realistic and consistent role-playing responses in terms of both attractiveness and consistency, while also highlighting that OmniCharacter++ poses substantial challenges for state-of-the-art models, charting a clear path for future research. The Code is publicly available at: https://github.com/zchoi/OmniCharacter-plus.
Zhang et al. (Thu,) studied this question.