Quantization reduces the precision of neural network parameters to accelerate inference and lower power consumption, but it often causes noticeable accuracy degradation. We propose a differentiable quantization framework that replaces the non-differentiable rounding operation with a continuous surrogate function. During QAT, gradients are backpropagated through the proposed surrogate rather than being estimated by the STE, enabling gradient-based optimization of model weights, quantization parameters, and layer-wise bit-width configurations. Experiments on CIFAR-10 show that our method achieves higher accuracy than several representative quantization approximation methods under different bit-width settings. On embedded platforms, it improves post-quantization accuracy by up to 3.66 percentage points over industrial quantization frameworks such as TensorRT and Huawei AMCT on detection and segmentation tasks, and outperforms representative bit-width allocation methods by up to 7.49 percentage points. These results demonstrate the effectiveness of the proposed method for improving the accuracy of quantized neural networks on resource-constrained devices.
Yang et al. (Mon,) studied this question.