March 24, 2024Open Access

Norm Tweaking: High-Performance Low-Bit Quantization of Large Language Models

Key Points

Key points are not available for this paper at this time.

Abstract

As the size of large language models (LLMs) continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4-bit weight-only quantization, attempts at lower-bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better generalization. We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2-bit quantization as their float ones. Our simple and effective approach makes it more practical for real-world applications.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Li et al. (Sun,) studied this question.

synapsesocial.com/papers/68e72962b6db6435876a30e2 https://doi.org/https://doi.org/10.1609/aaai.v38i17.29815

KI fragen

Bookmark

View Full Paper