What does this research mean for the field?

PBBQ reduces perplexity in large language models by 21.46% and 22.02% when integrated with state-of-the-art quantization methods. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

To develop a technique that reduces model size and inference energy consumption without compromising performance.

February 16, 2026Open Access

PBBQ: Plug-In Balanced Binary Quantization for LLMs

Key Points

To develop a technique that reduces model size and inference energy consumption without compromising performance.
Proposed Plug-in Balanced Binary Quantization (PBBQ) technique.
Implemented block-wise dropout and layer-wise reordering.
Integrated PBBQ into existing GPTQ-style frameworks.
Evaluated performance based on perplexity metrics.
Achieved a 21.46% reduction in perplexity on WikiText-2 from 32.48 to 25.51.
Achieved a 22.02% reduction in perplexity on WikiText-2 from 16.44 to 12.82 with state-of-the-art methods.

Abstract

In recent years, the expansion of large-model parameters has substantially increased storage and inference overhead. Consequently, post-training quantization has become a key technique for reducing model size and inference-time energy consumption. However, we observe that, under extremely low bit-width settings, mainstream error-compensation-based algorithms tend to overfit the calibration data. To mitigate this issue, we propose Plug-in Balanced Binary Quantization for LLMs (PBBQ), which reduces the excessive emphasis on subsequent channels via block-wise dropout and layer-wise reordering. PBBQ can be integrated into GPTQ-style frameworks and ultra-low-bit methods such as BiLLM and ARB-LLM. Experimental results show that PBBQ significantly improves the performance of multiple error-compensation quantization algorithms. When combined with the state-of-the-art methods BiLLM and ARB-LLM, the perplexity (ppl) on WikiText-2 is reduced by 21.46% (from 32.48 to 25.51) and 22.02% (from 16.44 to 12.82), respectively.

Ask AI

Helpful

Bookmark

View Full Paper