The rapid emergence of generative and agentic artificial intelligence (AI) has outpaced traditional evaluation practices. While large language models excel on static language benchmarks, real-world deployment demands more than accuracy on curated tasks. Agentic systems use planning, tool invocation, memory, and multi-agent collaboration to perform complex workflows. Enterprise adoption therefore hinges on holistic assessments that include cost, latency, reliability, safety, and multi-agent coordination. This survey provides a comprehensive taxonomy of evaluation dimensions, reviews existing benchmarks for generative and agentic systems, identifies gaps between laboratory tests and production requirements, and proposes future directions for more realistic, multi-dimensional benchmarking.
Building similarity graph...
Analyzing shared references across papers
Loading...
Manmohan Shukla
Building similarity graph...
Analyzing shared references across papers
Loading...
Manmohan Shukla (Wed,) studied this question.
www.synapsesocial.com/papers/69449a922f0218eca9508656 — DOI: https://doi.org/10.20944/preprints202512.1421.v1
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: