Mamba2, as one of the most promising variants of state space models (SSMs), has shown remarkable performance in various domains. However, accelerating Mamba2 on existing hardware architectures is still challenging due to its inefficiency in processing element-wise (EW) operations. In this work, we propose PMSCA, which uses parallel computing arrays to accelerate sparse Mamba2 models. Firstly, we use a dynamic bit-width scaling strategy and weight pruning method to significantly reduce memory overhead. Additionally, sparse computing based on weight pruning greatly improves the throughput of matrix multiplication. Secondly, we propose a hybrid layer-wise N:M structured pruning method to reduce the accuracy loss of weight pruning. Thirdly, we propose a unified multi-branch structured state space duality (SSD)-post-processing element (PPE) architecture to improve the computing efficiency of element-wise operations, thus achieving parallel computation of element-wise operations and matrix multiplications within SSD. What’s more, we propose a well-designed mapping and parallel hardware scheduling strategy to balance workload and further improve efficiency. Compared with the Intel i7-14700k CPU and Nvidia RTX-4090 GPU, our design achieves 68-192×/3-74× speedup of SSD, 23-53×/0.7-12× speedup of Mamba2, and 1655-3618×/44-816× higher energy efficiency, respectively.
Zheng et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: