Quantization has emerged as an effective technique to reduce the memory requirement of Large Language Models (LLMs). However, we observed that activation smoothing destroys the original flat distribution of weights and existing methods struggle to smooth activation outliers with extremely large magnitudes, resulting in increased quantization errors. To overcome these challenges, we propose CHAMP-Q, a novel quantization method that relies on two strategies: (1) To mitigate the impact of activation smoothing on weights, a multi-dimensional feature-aware channel permutation (MFCP) strategy is designed to alleviate intra-group weight variances by permuting similar channels to adjacent positions in the weight matrix, thereby reducing the group-wise weight quantization error. (2) To reduce the quantization errors caused by activation outliers, a hybrid numerical smoothing (HNS) strategy is proposed to suppress activation outliers by selectively applying different smoothing strategies based on their magnitudes. Furthermore, we implement a W4A8 quantization framework. The experimental results demonstrate that CHAMP-Q enables 4-bit weight and 8-bit activation quantization with less accuracy degradation compared to existing outlier smoothing methods, and achieves up to 1. 74 memory savings and 1. 45 inference speedup compared to the original model.
Guan et al. (Thu,) studied this question.