Transformer-based large language models (LLMs) have achieved remarkable success. However, their growing size presents challenges due to the increasing mismatch between model scale and hardware capacity. Model compression techniques have been proposed to address this issue, but existing methods often struggle to cope with the large dynamic range of model parameters, including both activations and weights. Additionally, current outlier-aware encoding schemes introduce complex logic, which limits their effectiveness in both compression ratio and hardware efficiency. To overcome these challenges, we propose SPADE, a co-designed algorithm–architecture solution for post-training quantized INT8 large language models, introducing a hybrid-precision encoding scheme. SPADE employs variable-length data representation guided by metadata, enabling local, value-dependent encoding with only 3.66% hardware area overhead while delivering up to 4.88 × speedup and 70.5% energy reduction compared with state-of-the-art encoding-based accelerators. Our key insight is that only a small portion of activations are outliers that require high-precision such as INT8, while the vast majority can be accurately represented with INT4. This allows us to exploit the sparsity of high-significance bits across the data. By assigning an identifier bit to each data element, SPADE dynamically selects between compact INT4 encoding for normal values and higher INT8 encoding for rare outliers, eliminating unnecessary bit-width and reducing data redundancy. Moreover, we analyze the limitations of existing variable-length encoding methods, which typically require serialization and deserialization due to uncertainty in the precision format of loaded data blocks. SPADE integrates precision information directly into its encoding rules, enabling parallel decoding and computation on systolic arrays. This design significantly boosts computational throughput and helps realize the theoretical performance advantages of structured encoding. We evaluate SPADE-based accelerators against state-of-the-art encoding-based accelerators, and our results demonstrate that the SPADE-based accelerator achieves up to 4.88 × speedup and 70.5% energy reduction, while maintaining superior model accuracy. The codes are available at https://github.com/jlsbz/SPADE.
Yang et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: