What question did this study set out to answer?

This research aims to enhance the efficiency and accuracy of long-context language models during decoding.

June 17, 2026Open Access

BTHA: Block-Then-Hash Attention for Efficient Long Context

Key Points

This research aims to enhance the efficiency and accuracy of long-context language models during decoding.
Proposed a two-stage key-value retrieval method, Block-Then-Hash Attention (BTHA).
Utilized block-level routing followed by a learnable hash network for dynamic key retrieval.
Conducted experiments on LongBench-E comparing BTHA to other attention methods.
BTHA achieved the best average accuracy on Llama-2-7B-32K-Instruct and Llama-3.1-8B-Instruct under a 512-position budget.
Demonstrated up to 7.0× speedup over standard full attention methods.
Outperformed state-of-the-art top-K attention methods in both accuracy and efficiency.

Abstract

Long-context large language models incur substantial computational overhead during autoregressive decoding. Existing sparse attention methods can improve inference efficiency, but they typically rely on fixed sparse patterns, historical attention statistics, or coarse-grained proxy representations to estimate important KV positions, making it difficult to accurately capture query-dependent fine-grained relevance for dynamic KV retrieval. In this paper, we propose Block-then-Hash Attention (BTHA), a two-stage KV retrieval method: it first performs block-level routing with mean key representations to rapidly reduce the candidate search space, and then applies a learnable orthogonal hash network within the routed KV candidates for fine-grained token-level position retrieval. The hash network is trained offline to learn the hash mapping between queries and keys, with a low training cost: on Llama-3.1-8B-Instruct, training can be completed in approximately two hours using a single NVIDIA A100 GPU. During inference, BTHA implements block-level routing, hash-based retrieval, and sparse attention computation with dedicated operators, and further employs CPU–GPU collaborative scheduling to reduce memory access, synchronization, and candidate selection overhead, thereby achieving end-to-end decoding acceleration. Extensive experiments on LongBench-E show that BTHA consistently outperforms state-of-the-art top-K attention methods in both accuracy and efficiency; under a 512-position budget, it achieves the best average accuracy on both Llama-2-7B-32K-Instruct and Llama-3.1-8B-Instruct, while delivering up to 7.0× speedup over vanilla full attention.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper

Cite This Study

Liu et al. (Mon,) studied this question.

synapsesocial.com/papers/6a323e9ed50b63ecad207c4a https://doi.org/https://doi.org/10.3390/electronics15122635

AI에게 질문

Bookmark

View Full Paper