This work introduces the Energy Decision Benchmark (EDB), a reproducible evaluation framework designed to assess whether AI systems are suitable for deployment in regulated, safety-critical energy infrastructures. Unlike existing benchmarks that evaluate linguistic quality or analytical plausibility, EDB evaluates governed decision-making under explicit physical, regulatory, and economic constraints. EDB formalizes the distinction between functional capability and operational readiness. It evaluates systems across nine capability blocks: five functional capabilities (C1–C5)—consistency and reproducibility, constraint validation, portfolio evaluation, governed decision, and counterfactual robustness—and four operational capabilities (C6–C9)—multi-turn coherence, operational sufficiency, rollback, and context isolation. Functional capabilities are scored quantitatively, while operational capabilities act as necessary gates for deployment. The benchmark is instantiated on residential and small-business energy scenarios under Spanish regulation, including photovoltaic generation, battery storage, electric-vehicle integration, tariff selection, and contracted power optimization. EDB defines structured input and output schemas, explicit constraint rules, failure conditions, and an Operational Readiness Gate (ORG) that enforces absolute determinism, efficiency, and auditability as non-negotiable deployment requirements. As a baseline, the paper evaluates a deterministic vertical AGI architecture and compares it against state-of-the-art general-purpose large language models under identical inputs and protocols. Results show that the deterministic baseline achieves Operational Readiness Level 5, completing all nine capability blocks with no observed failures. In contrast, general-purpose LLMs fail the operational gate due to structural non-determinism, lack of action-threshold calibration, and state contamination under context switching, despite high performance on isolated functional tasks. EDB reframes benchmarking in critical domains: from measuring how well models answer questions to determining whether systems are admissible for real-world deployment. The results demonstrate that operational readiness in regulated infrastructures is an architectural property, not a prompt-level or model-size effect, and that deterministic, rule-governed systems are required to cross this boundary.
A. A. Diaz-Gonzalez (Fri,) studied this question.