What question did this study set out to answer?

The aim is to assess if AI systems are operationally ready for deployment in safety-critical energy infrastructures using the Energy Decision Benchmark.

January 22, 2026Open Access

From Vertical AGI to Operational Readiness: Evaluating Governed Decision-Making with the Energy Decision Benchmark

Key Points

The aim is to assess if AI systems are operationally ready for deployment in safety-critical energy infrastructures using the Energy Decision Benchmark.
Developed the Energy Decision Benchmark to evaluate decision-making under explicit constraints.
Scored functional capabilities quantitatively and operational capabilities as gates for deployment.
Instantiated the benchmark on energy scenarios, including photovoltaic generation and battery storage.
Compared a deterministic vertical AGI architecture with general-purpose large language models.
The deterministic baseline achieved Operational Readiness Level 5 with no failures.
General-purpose LLMs failed the operational gate due to structural non-determinism and context switching issues.
Operational readiness was determined to be an architectural property, not merely dependent on model size or prompt-level adjustments.

Abstract

This work introduces the Energy Decision Benchmark (EDB), a reproducible evaluation framework designed to assess whether AI systems are suitable for deployment in regulated, safety-critical energy infrastructures. Unlike existing benchmarks that evaluate linguistic quality or analytical plausibility, EDB evaluates governed decision-making under explicit physical, regulatory, and economic constraints. EDB formalizes the distinction between functional capability and operational readiness. It evaluates systems across nine capability blocks: five functional capabilities (C1–C5)—consistency and reproducibility, constraint validation, portfolio evaluation, governed decision, and counterfactual robustness—and four operational capabilities (C6–C9)—multi-turn coherence, operational sufficiency, rollback, and context isolation. Functional capabilities are scored quantitatively, while operational capabilities act as necessary gates for deployment. The benchmark is instantiated on residential and small-business energy scenarios under Spanish regulation, including photovoltaic generation, battery storage, electric-vehicle integration, tariff selection, and contracted power optimization. EDB defines structured input and output schemas, explicit constraint rules, failure conditions, and an Operational Readiness Gate (ORG) that enforces absolute determinism, efficiency, and auditability as non-negotiable deployment requirements. As a baseline, the paper evaluates a deterministic vertical AGI architecture and compares it against state-of-the-art general-purpose large language models under identical inputs and protocols. Results show that the deterministic baseline achieves Operational Readiness Level 5, completing all nine capability blocks with no observed failures. In contrast, general-purpose LLMs fail the operational gate due to structural non-determinism, lack of action-threshold calibration, and state contamination under context switching, despite high performance on isolated functional tasks. EDB reframes benchmarking in critical domains: from measuring how well models answer questions to determining whether systems are admissible for real-world deployment. The results demonstrate that operational readiness in regulated infrastructures is an architectural property, not a prompt-level or model-size effect, and that deterministic, rule-governed systems are required to cross this boundary.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

A. A. Diaz-Gonzalez (Fri,) studied this question.

synapsesocial.com/papers/6971be50642b1836717e2ea6 https://doi.org/https://doi.org/10.5281/zenodo.18269146

Bookmark

View Full Paper