What question did this study set out to answer?

The aim is to introduce QPP, a new method for compressing language model parameters effectively.

July 1, 2026Open Access

QPP: Parametric Compression via Quantile Curves for Large Language Models

Key Points

The aim is to introduce QPP, a new method for compressing language model parameters effectively.
Developed a quantile curve framework with 32 anchors for weight matrix representation.
Implemented a hybrid pipeline featuring a 2-bit codebook and INT8/INT4 quantization.
Achieved efficient parameter reduction by combining compression methods effectively.
Achieved 41.1% reduction in model size from 8,045 MB to 4,738 MB.
QPP demonstrated 7x efficiency improvement over GGUF Q4_K_M on attention layers.
Increased inference speed by 69% with cached weights.

Abstract

We present QPP (Quantile Piecewise Perceptron), a novel parametric compression technique that reduces the number of parameters rather than just their bit precision. QPP represents each weight matrix row as a quantile curve fitted with 32 anchors and block-shared ordering. Combined with a 2-bit codebook over anchors and INT8/INT4 quantization for incompatible layers, our hybrid pipeline achieves 41. 1% compression on Qwen3-4B (8, 045 to 4, 738 MB) with coherent generation. QPP is 7x more effective than GGUF Q4KM on attention layers (21x vs 3. 2x) and 69% faster at inference with cached weights. We also built a physical QPP+GGUF combined file demonstrating orthogonal compression axes.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper