What type of study is this?

This is a Systematic Review study.

What question did this study set out to answer?

This research aims to evaluate LLM-based conversational agents and their assessment methods.

February 16, 2026

Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey

Key Points

This research aims to evaluate LLM-based conversational agents and their assessment methods.
Systematic review of nearly 250 scholarly sources
Development of two taxonomies for evaluation components and methodologies
Categorization of evaluation approaches including annotation-based and automated metrics.
Identified key components for evaluating LLM-based agents
Highlighted the importance of user experience and tool integration
Proposed future needs for scalable evaluation pipelines and robust assessment metrics.

Abstract

This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. Using a PRISMA-inspired framework, we systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication, and establishing a solid foundation for our analysis. Our study offers a structured approach by developing two interrelated taxonomy systems: one that defines what to evaluate and another that explains how to evaluate . The first taxonomy identifies key components of LLM-based agents for multi-turn conversations and their evaluation dimensions, including task completion, response quality, user experience, memory and context retention, as well as planning and tool integration. These components ensure that the performance of conversational agents is assessed in a holistic and meaningful manner. The second taxonomy system focuses on the evaluation methodologies. It categorizes approaches into annotation-based evaluations, automated metrics, hybrid strategies that combine human assessments with quantitative measures, and self-judging methods utilizing LLMs. This framework not only captures traditional metrics derived from language understanding, such as BLEU and ROUGE scores, but also incorporates advanced techniques that reflect the dynamic, interactive nature of multi-turn dialogues. Together, these frameworks summarize the current status quo, expose limitations in traditional practices, and provide a structured blueprint for improvement. Based on the summarization of existing studies, we identify several challenges and propose future directions, including the development of scalable, real-time evaluation pipelines, enhanced privacy-preserving mechanisms, and robust metrics that capture dynamic multi-turn interactions. Our contributions bridge historical insights with modern practices, paving the way for next-generation, reliably evaluated conversational AI systems and offering a comprehensive guide for researchers and practitioners.

Demander à l'IA

Bookmark

Cite This Study

Guan et al. (Sat,) studied this question.

synapsesocial.com/papers/69926503eb1f82dc367a0d09 https://doi.org/https://doi.org/10.1145/3793671

Demander à l'IA

Bookmark