This research proposes a hybrid approach that combines linear attention, chunking, and recurrent mechanisms to address the efficiency issues of Large Language Models (LLMs) within the traditional transformer framework. Our approach integrates three key innovations: We use linear attention to employ kernel function mapping to reduce time and space complexity from O (n²) to O (n) ; The proposed dynamic chunk-based processing, can compress 5 times KV cache with mean pooling; Through 3 different ways, our hard thresholding, adaptive gating, and hierarchical chunking, can filter token and reduce load. The result shows that it can actually improve the efficiency of LLM, and performs excellently among some evaluation tools. Experiments demonstrate that our 3. 2B parameter model achieves excellent performance in multiple benchmark tests, outperforming dense models of similar scale and even matching the performance of larger models in certain tasks, which provides a theoretically grounded and empirically validated framework for efficient LLM optimization.
Zhang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: