What question did this study set out to answer?

To evaluate the state of generative and agentic AI systems and identify gaps in existing evaluation practices.

December 19, 2025Open Access

Evaluation and Benchmarking of Generative and Agentic AI Systems: A Comprehensive Survey

Key Points

To evaluate the state of generative and agentic AI systems and identify gaps in existing evaluation practices.
Conducted a comprehensive survey of evaluation dimensions for generative and agentic AI systems.
Reviewed existing benchmarks used for evaluating AI systems in real-world applications.
Proposed future directions for improving benchmarking practices and addressing gaps identified.
Identified that current evaluations often miss key dimensions like cost and multi-agent coordination.
Proposed a taxonomy of evaluation dimensions that increases the relevance of assessments in enterprise settings.
Found that many existing benchmarks do not align with practical deployment needs for AI systems.

Abstract

The rapid emergence of generative and agentic artificial intelligence (AI) has outpaced traditional evaluation practices. While large language models excel on static language benchmarks, real-world deployment demands more than accuracy on curated tasks. Agentic systems use planning, tool invocation, memory, and multi-agent collaboration to perform complex workflows. Enterprise adoption therefore hinges on holistic assessments that include cost, latency, reliability, safety, and multi-agent coordination. This survey provides a comprehensive taxonomy of evaluation dimensions, reviews existing benchmarks for generative and agentic systems, identifies gaps between laboratory tests and production requirements, and proposes future directions for more realistic, multi-dimensional benchmarking.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper