What question did this study set out to answer?

The aim is to enhance the processing efficiency of Mamba2 models via parallel computing strategies.

June 14, 2026Open Access

PMSCA: Parallel MAC and SSD Computation Arrays for the Acceleration of Sparse Mamba2 Models

Key Points

The aim is to enhance the processing efficiency of Mamba2 models via parallel computing strategies.
Proposed PMSCA utilizing dynamic bit-width scaling and weight pruning for memory efficiency.
Implemented N:M structured pruning to minimize accuracy loss during weight pruning.
Developed a unified multi-branch SSD post-processing element architecture for improved element-wise operation efficiency.
Achieved 68-192× speedup for SSD and 23-53× speedup for Mamba2 against traditional hardware.
Demonstrated 1655-3618× higher energy efficiency compared to Intel i7-14700k CPU and Nvidia RTX-4090 GPU.

Abstract

Mamba2, as one of the most promising variants of state space models (SSMs), has shown remarkable performance in various domains. However, accelerating Mamba2 on existing hardware architectures is still challenging due to its inefficiency in processing element-wise (EW) operations. In this work, we propose PMSCA, which uses parallel computing arrays to accelerate sparse Mamba2 models. Firstly, we use a dynamic bit-width scaling strategy and weight pruning method to significantly reduce memory overhead. Additionally, sparse computing based on weight pruning greatly improves the throughput of matrix multiplication. Secondly, we propose a hybrid layer-wise N:M structured pruning method to reduce the accuracy loss of weight pruning. Thirdly, we propose a unified multi-branch structured state space duality (SSD)-post-processing element (PPE) architecture to improve the computing efficiency of element-wise operations, thus achieving parallel computation of element-wise operations and matrix multiplications within SSD. What’s more, we propose a well-designed mapping and parallel hardware scheduling strategy to balance workload and further improve efficiency. Compared with the Intel i7-14700k CPU and Nvidia RTX-4090 GPU, our design achieves 68-192×/3-74× speedup of SSD, 23-53×/0.7-12× speedup of Mamba2, and 1655-3618×/44-816× higher energy efficiency, respectively.

Bookmark

View Full Paper

Cite This Study

Zheng et al. (Thu,) studied this question.

synapsesocial.com/papers/6a2e4524b1cc60ccdea8a6cf https://doi.org/https://doi.org/10.1587/elex.23.20260188

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper