Large Language Models (LLMs) have demonstrated exceptional performance in knowledge-based dialogue generation and text evaluation. Synthetic data serves as a cost-effective alternative for generating high-quality datasets. However, it often plagued by hallucinations, inconsistencies, and self-anthropomorphized responses. Concurrently, manual construction of knowledge-based dialogue datasets remains bottlenecked by prohibitive costs and inherent human subjectivity. To address these multifaceted challenges, we propose ACE (Automatic Construction of Knowledge-Grounded and Engaging Human–AI Conversation Dataset), a hybrid method using hierarchical prompt engineering. This approach mitigates hallucinations and self-personalization while maintaining response consistency. Furthermore, existing human and automated evaluation methods struggle to assess critical factors like factual accuracy and coherence. To overcome this, we introduce the Truthful Answer Score (TAS), a novel metric specifically designed for knowledge-based dialogue evaluation. Our experimental results demonstrate that the ACE dataset achieves higher quality than existing benchmarks, such as Wizard of Wikipedia (WoW) and FaithDial. Additionally, TAS aligns more closely with human judgment, offering a more reliable and scalable evaluation framework. Our findings demonstrate that leveraging LLMs through systematic prompting can substantially reduce reliance on human annotation while simultaneously elevating the quality and reliability of synthetic datasets.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hyeongju Ju
EunKyeong Lee
Junyoung Kang
Applied Sciences
Kyungpook National University
Korea Telecom (South Korea)
Building similarity graph...
Analyzing shared references across papers
Loading...
Ju et al. (Thu,) studied this question.
www.synapsesocial.com/papers/6980fcfcc1c9540dea80eb62 — DOI: https://doi.org/10.3390/app16031387