What question did this study set out to answer?

This research aims to enhance AI efficiency by developing an algorithmic stack that improves intelligence per watt (IPW).

April 24, 2026Open Access

You Are Wasting 98% of Your Inference Compute: 54x Intelligence Per Watt from Software Alone — A Multiplicative Algorithmic Stack for LLM Inference

Key Points

This research aims to enhance AI efficiency by developing an algorithmic stack that improves intelligence per watt (IPW).
Introduced a three-layer algorithmic stack for optimizing AI inference performance.
Conducted a longitudinal analysis comparing efficiency improvements from 2023 to 2025.
Employed key mechanisms: memory bandwidth efficiency, power metric allocation, and operation depth reduction.
Achieved a cumulative improvement of 54x in IPW at a sequence length of 4,096 tokens.
Estimated overall IPW improvement of approximately 122x versus the 2023 baseline when applied to existing hardware.
Demonstrated that these gains do not rely on future hardware advancements.

Abstract

Saad-Falcon et al. (2025) introduced Intelligence Per Watt (IPW) as the critical metric for tracking AI efficiency: task accuracy divided by power consumed. Their longitudinal study documents 5.3x IPW improvement from 2023-2025, driven by model and hardware advances. This paper demonstrates that a stack of algorithmic efficiency optimizations — derived from a unified stochastic health monitoring framework — provides an additional multiplicative IPW improvement on top of whatever hardware is available. The core three-layer algorithmic stack (FlashAttention, run-level power metric allocation, and early exit) provides a combined 54x IPW improvement at sequence length 4,096 tokens, through three orthogonal mechanisms: per-operation memory bandwidth efficiency (FlashAttention, 2.86x), allocation efficiency reducing which operations occur (power metric inference, 5.18x), and depth efficiency reducing how many layers each operation uses (early exit, 1.56x). These layers are independent and compound multiplicatively. Applied on top of Saad-Falcon et al.'s 2025 hardware baseline, the combined IPW improvement is estimated at up to approximately 122x versus the 2023 baseline. The full stack including speculative decoding and quality-preserving layers reaches 70x algorithmic improvement alone. Critically, these algorithmic gains are available today on existing hardware — they do not require waiting for the next hardware generation. Keywords: intelligence per watt, IPW, energy efficiency, algorithmic efficiency, FlashAttention, power metric, early exit, speculative decoding, compute stack, Saad-Falcon These estimates will strike some readers as impossible. They are not. They represent a structured upper bound under partial independence assumptions — which is to say, the math works out this way even if we wish it were less dramatic. The headline number is large. We checked. It's still large. The purpose of this paper is not to assert realized gains, but to map a hypothesis space and identify where empirical validation is most valuable.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper