What question did this study set out to answer?

This research aims to develop an evaluation framework for agentic AI systems that ensures safety and governance in their deployment.

January 20, 2026Open Access

SAGE-AI: Structured, Auditable, and Statistically Rigorous Evaluation of Agentic AI Systems

Puntos clave

This research aims to develop an evaluation framework for agentic AI systems that ensures safety and governance in their deployment.
Introduced SAGE-AI architecture with distinct roles (planner, executor, critic, synthesizer).
Established a governance-aware orchestration layer for policy enforcement and traceability.
Conducted controlled experimental evaluations comparing SAGE-AI with a monolithic agent baseline across various tasks.
SAGE-AI reduces unsafe actions in comparison to monolithic agents.
Traceability and task reliability showed significant improvement with SAGE-AI.
Expected latency overhead occurs due to governance and multi-role coordination.

Resumen

Recent advances in large language models (LLMs) have enabled autonomous agentic systems capable of multi-step planning, tool use, and iterative reasoning. Despite their promise, many existing agent-based implementations remain ad hoc, tightly coupled, and insufficiently governed, limiting their suitability for enterprise and safety-critical environments. In particular, the lack of deterministic control, systematic evaluation, and auditability poses significant challenges for real-world deployment. This paper introduces SAGE-AI (Structured Agentic Governance for Enterprise AI), a composable architecture and evaluation framework for agentic AI systems. SAGE-AI decomposes autonomous behavior into explicit planner, executor, critic, and synthesizer roles, coordinated through a governance-aware orchestration layer. This design enables policy enforcement, controlled tool usage, and end-to-end traceability without modifying the underlying language models. We present a controlled experimental evaluation comparing SAGE-AI against a monolithic agent baseline across planning, tool-use, and verification tasks. The evaluation emphasizes architectural behavior rather than raw model performance, measuring task success, invalid actions, trace completeness, and recovery behavior under identical conditions. Results show that SAGE-AI reduces unsafe actions, improves traceability, and improves task reliability relative to monolithic agents, with an expected latency overhead from governance and multi-role coordination. By combining architectural decomposition with a statistically rigorous, auditable evaluation methodology, this work bridges the gap between experimental agentic systems and production-ready autonomous AI.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

John Benito Jesudasan Peter (Sun,) studied this question.

synapsesocial.com/papers/696f1a629e64f732b51eea15 https://doi.org/https://doi.org/10.5281/zenodo.18286008

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo