What question did this study set out to answer?

To examine algorithmic optimizations that can improve AI efficiency, specifically Intelligence Per Watt (IPW).

April 23, 2026Open Access

You Are Wasting 96% of Your Inference Compute: 23x Intelligence Per Watt from Software Alone — A Multiplicative Algorithmic Stack for LLM Inference

Key Points

To examine algorithmic optimizations that can improve AI efficiency, specifically Intelligence Per Watt (IPW).
Introduced a three-layer algorithmic stack: FlashAttention, run-level power metric allocation, and early exit.
Evaluated IPW improvements through various efficiency mechanisms in AI hardware from 2023 to 2025.
Presented results based on existing hardware capabilities, not requiring future models.
Achieved a combined 23x IPW improvement at sequence length 4,096 tokens.
Identified independent gains: FlashAttention (2.86x), power metric allocation (5.18x), and early exit (1.56x).
Estimated overall IPW improvement of up to 122x from the 2023 baseline on 2025 hardware.

Abstract

Saad-Falcon et al. (2025) introduced Intelligence Per Watt (IPW) as the critical metric for tracking AI efficiency: task accuracy divided by power consumed. Their longitudinal study documents 5.3x IPW improvement from 2023-2025, driven by model and hardware advances. This paper demonstrates that a stack of algorithmic efficiency optimizations — derived from a unified stochastic health monitoring framework — provides an additional multiplicative IPW improvement on top of whatever hardware is available. The core three-layer algorithmic stack (FlashAttention, run-level power metric allocation, and early exit) provides a combined 23x IPW improvement at sequence length 4,096 tokens, through three orthogonal mechanisms: per-operation memory bandwidth efficiency (FlashAttention, 2.86x), allocation efficiency reducing which operations occur (power metric inference, 5.18x), and depth efficiency reducing how many layers each operation uses (early exit, 1.56x). These layers are independent and compound multiplicatively. Applied on top of Saad-Falcon et al.'s 2025 hardware baseline, the combined IPW improvement is estimated at up to approximately 122x versus the 2023 baseline. The full stack including speculative decoding and quality-preserving layers reaches 70x algorithmic improvement alone. Critically, these algorithmic gains are available today on existing hardware — they do not require waiting for the next hardware generation.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Cole Cantrell (Tue,) studied this question.

synapsesocial.com/papers/69e9b91385696592c86ebfc0 https://doi.org/https://doi.org/10.5281/zenodo.19685841

Bookmark

View Full Paper