What question did this study set out to answer?

The aim is to improve the quality and evaluation of synthetic knowledge-based dialogues by reducing hallucinations and manual costs.

February 2, 2026Open Access

Truth Is Better Generated than Annotated: Hierarchical Prompt Engineering and Adaptive Evaluation for Reliable Synthetic Knowledge Dialogues

Key Points

The aim is to improve the quality and evaluation of synthetic knowledge-based dialogues by reducing hallucinations and manual costs.
Developed ACE, a hybrid method combining hierarchical prompt engineering.
Introduced the Truthful Answer Score (TAS) for evaluating dialogue quality.
Compared performance against existing benchmarks like Wizard of Wikipedia and FaithDial.
ACE achieved higher quality dialogues than existing datasets.
TAS provided better alignment with human evaluations of dialogue quality.
Reduced dependence on human annotation while improving dataset reliability.

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance in knowledge-based dialogue generation and text evaluation. Synthetic data serves as a cost-effective alternative for generating high-quality datasets. However, it often plagued by hallucinations, inconsistencies, and self-anthropomorphized responses. Concurrently, manual construction of knowledge-based dialogue datasets remains bottlenecked by prohibitive costs and inherent human subjectivity. To address these multifaceted challenges, we propose ACE (Automatic Construction of Knowledge-Grounded and Engaging Human–AI Conversation Dataset), a hybrid method using hierarchical prompt engineering. This approach mitigates hallucinations and self-personalization while maintaining response consistency. Furthermore, existing human and automated evaluation methods struggle to assess critical factors like factual accuracy and coherence. To overcome this, we introduce the Truthful Answer Score (TAS), a novel metric specifically designed for knowledge-based dialogue evaluation. Our experimental results demonstrate that the ACE dataset achieves higher quality than existing benchmarks, such as Wizard of Wikipedia (WoW) and FaithDial. Additionally, TAS aligns more closely with human judgment, offering a more reliable and scalable evaluation framework. Our findings demonstrate that leveraging LLMs through systematic prompting can substantially reduce reliance on human annotation while simultaneously elevating the quality and reliability of synthetic datasets.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper