As AI inference scales to billions of queries, estimates of per-query energy use are increasingly important for capacity planning, efficiency interventions, and policy. Yet many public estimates assume non-production settings, leading to systematic overestimation. We introduce a bottom-up framework estimating inference energy from token throughput, node power, and overhead under large-scale deployment assumptions. For frontier-scale models (>200B parameters) on H100 nodes, we estimate a median energy of 0.31 Wh/query (IQR 0.16–0.60), indicating widely cited estimates are overstated by 4–20×. In test-time scaling scenarios 15× longer than typical queries, the median energy rises 13× to 3.91 Wh (IQR 2.15–7.05). Across models, serving systems, and hardware, we estimate 8–20× line-of-sight energy reductions. At datacenter scale, serving 1 billion queries/day requires 0.7 GWh; if 10% are long queries, demand rises to 1.7 GWh/day. With efficiency interventions, it falls to 0.8 GWh/day, mitigating the energy impact of test-time scaling. In this version we also include an estimate of water per query in hyperscalers. This repository is provided for research and informational purposes only and does not constitute legal, regulatory, compliance, or policy guidance. Results should be interpreted as assumption-dependent and directional, not as definitive measurements of AI energy or water use across all systems or as guaranteed efficiency outcomes. The analysis focuses on per-query inference energy and efficiency pathways only and is not a full environmental or lifecycle assessment.
Oviedo et al. (Tue,) studied this question.