Long-context large language models incur substantial computational overhead during autoregressive decoding. Existing sparse attention methods can improve inference efficiency, but they typically rely on fixed sparse patterns, historical attention statistics, or coarse-grained proxy representations to estimate important KV positions, making it difficult to accurately capture query-dependent fine-grained relevance for dynamic KV retrieval. In this paper, we propose Block-then-Hash Attention (BTHA), a two-stage KV retrieval method: it first performs block-level routing with mean key representations to rapidly reduce the candidate search space, and then applies a learnable orthogonal hash network within the routed KV candidates for fine-grained token-level position retrieval. The hash network is trained offline to learn the hash mapping between queries and keys, with a low training cost: on Llama-3.1-8B-Instruct, training can be completed in approximately two hours using a single NVIDIA A100 GPU. During inference, BTHA implements block-level routing, hash-based retrieval, and sparse attention computation with dedicated operators, and further employs CPU–GPU collaborative scheduling to reduce memory access, synchronization, and candidate selection overhead, thereby achieving end-to-end decoding acceleration. Extensive experiments on LongBench-E show that BTHA consistently outperforms state-of-the-art top-K attention methods in both accuracy and efficiency; under a 512-position budget, it achieves the best average accuracy on both Llama-2-7B-32K-Instruct and Llama-3.1-8B-Instruct, while delivering up to 7.0× speedup over vanilla full attention.
Liu et al. (Mon,) studied this question.