Attention mechanisms with large head dimensions (D > 256) are increasingly common in large multimodal models (e. g. , Gemma4-31B), yet standard FlashAttention kernels—optimized for D ≤ 256—suffer from shared memory exhaustion and register pressure when D grows beyond 256. We present FFPA (Faster Flash Prefill Attention), which introduces Split-D, a head-dimension chunking strategy that decomposes QK transpose and PV into sub-operations over D-slices. Split-D reduces SRAM complexity from O (Bc × D) to O (Bc × Dchunk) ≈ O (1), keeping active register pressure bounded at O (Dchunk). The forward pass uses Split-D universally with no precision issues, suitable for both training and inference on any GPU. For the backward pass, two hardware-driven strategies are provided: an fp32 gradient buffer for compute-limited SM ≤ 89 GPUs, and a Hopper-specialized CuTe DSL kernel with full-D WGMMA forward and recompute-based backward for SM ≥ 90 GPUs. Across H200, H800, H20, RTX 5090, and L20, FFPA achieves 1. 5–6. 1× speedup over PyTorch SDPA for D in 320, 1024, reaching 426 TFLOPS on H200, and 1. 4–1. 5× end-to-end training throughput on Gemma4-31B. Code is available at https: //github. com/xlite-dev/ffpa-attn.
DefTruth et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: