In recent years, the expansion of large-model parameters has substantially increased storage and inference overhead. Consequently, post-training quantization has become a key technique for reducing model size and inference-time energy consumption. However, we observe that, under extremely low bit-width settings, mainstream error-compensation-based algorithms tend to overfit the calibration data. To mitigate this issue, we propose Plug-in Balanced Binary Quantization for LLMs (PBBQ), which reduces the excessive emphasis on subsequent channels via block-wise dropout and layer-wise reordering. PBBQ can be integrated into GPTQ-style frameworks and ultra-low-bit methods such as BiLLM and ARB-LLM. Experimental results show that PBBQ significantly improves the performance of multiple error-compensation quantization algorithms. When combined with the state-of-the-art methods BiLLM and ARB-LLM, the perplexity (ppl) on WikiText-2 is reduced by 21.46% (from 32.48 to 25.51) and 22.02% (from 16.44 to 12.82), respectively.
Li et al. (Fri,) studied this question.