What question did this study set out to answer?

This research aims to evaluate the compound efficiency of FlashAttention and the power metric in the inference compute stack.

April 29, 2026Open Access

The Two-Layer Efficiency Stack: FlashAttention (Operation Efficiency) and Power Metric (Allocation Efficiency) as Independent, Compounding Layers of the Inference Compute Stack

Key Points

This research aims to evaluate the compound efficiency of FlashAttention and the power metric in the inference compute stack.
Combined existing benchmarks of FlashAttention and power metric frameworks.
Calculated overall savings using the formula total_savings = 1 - (1 - FA_savings) × (1 - PM_savings).
Assessed efficiency at various sequence lengths (4,096 and 16,384 tokens).
At sequence length 4,096 tokens, combined inference savings reached 97.4% (39.1x multiplier).
At sequence length 16,384 tokens, combined inference savings achieved 98.4% (62.3x).
Combined training savings reached 75.2% (4.0x) using hypothesized reductions.

Abstract

This paper is a combination calculator, not a new algorithm. It quantifies the compound efficiency of two fully independent methods that address different layers of the inference compute stack and can be deployed simultaneously without modification to either. FlashAttention (Dao et al. , 2022/2023) is an exact attention algorithm that reorganizes memory access to reduce bandwidth consumption per operation by 15-78% depending on sequence length. It does not prune tokens, skip computation, or change what is computed — only how. The power metric framework (Cantrell 2026) is an allocation signal that operates at a completely different level: it identifies unproductive training runs or inference samples and stops them early, saving 21-43% of training compute (hypothesized, Paper 1) or 92. 7% of sampling compute (simulation, Paper 2). These two methods address distinct bottlenecks (memory bandwidth vs. allocation waste) and compound multiplicatively. At sequence length 4, 096 tokens, combined inference savings reach 97. 4% (39. 1x multiplier), combining FlashAttention (65%) with adaptive sampling reduction (92. 7%, Paper 2 simulation). Combined training savings reach 75. 2% (4. 0x), using the hypothesized 29% training reduction from Paper 1 signal analysis. At sequence length 16, 384 tokens, combined inference savings reach 98. 4% (62. 3x). These results are computed from published FlashAttention benchmarks (Dao et al. , 2022, 2023) and power metric results from prior work (Cantrell 2026). No new experiments are required: the combination formula is totalₛavings = 1 - (1 - FAₛavings) × (1 - PMₛavings), which follows from the independence of the two optimizations. We propose that FlashAttention and the power metric form a natural two-layer efficiency stack and suggest the Intelligence Per Watt metric (Mirhoseini et al. , 2025) as the unified measure of the combined improvement. Keywords: FlashAttention, memory bandwidth, IO-awareness, power metric, compute efficiency, two-layer stack, multiplicative savings, intelligence per watt, training efficiency

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper