What question did this study set out to answer?

The aim is to improve efficiency in attention mechanisms with large head dimensions using FFPA.

June 13, 2026Open Access

FFPA: Efficient Flash Prefill Attention for Large Head Dimensions via Split-D

Key Points

The aim is to improve efficiency in attention mechanisms with large head dimensions using FFPA.
Introduced Split-D strategy to decompose QK transpose and PV into sub-operations over D-slices.
Implemented two hardware-specific strategies for the backward pass based on GPU specifications.
Measured performance improvements across various GPUs including H200 and RTX 5090.
FFPA achieves a speedup of 1.5–6.1× over PyTorch SDPA for head dimensions between 320 and 1024.
Reaches up to 426 TFLOPS on H200 GPU.
Increases end-to-end training throughput by 1.4–1.5× on the Gemma4-31B model.

Abstract

Attention mechanisms with large head dimensions (D > 256) are increasingly common in large multimodal models (e. g. , Gemma4-31B), yet standard FlashAttention kernels—optimized for D ≤ 256—suffer from shared memory exhaustion and register pressure when D grows beyond 256. We present FFPA (Faster Flash Prefill Attention), which introduces Split-D, a head-dimension chunking strategy that decomposes QK transpose and PV into sub-operations over D-slices. Split-D reduces SRAM complexity from O (Bc × D) to O (Bc × Dchunk) ≈ O (1), keeping active register pressure bounded at O (Dchunk). The forward pass uses Split-D universally with no precision issues, suitable for both training and inference on any GPU. For the backward pass, two hardware-driven strategies are provided: an fp32 gradient buffer for compute-limited SM ≤ 89 GPUs, and a Hopper-specialized CuTe DSL kernel with full-D WGMMA forward and recompute-based backward for SM ≥ 90 GPUs. Across H200, H800, H20, RTX 5090, and L20, FFPA achieves 1. 5–6. 1× speedup over PyTorch SDPA for D in 320, 1024, reaching 426 TFLOPS on H200, and 1. 4–1. 5× end-to-end training throughput on Gemma4-31B. Code is available at https: //github. com/xlite-dev/ffpa-attn.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper

Cite This Study

DefTruth et al. (Wed,) studied this question.

synapsesocial.com/papers/6a2cf6aefaef96ed7f05864d https://doi.org/https://doi.org/10.5281/zenodo.20638547

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AI에게 질문

Bookmark

View Full Paper