Large language model (LLM) deployment increasingly depends on model compression techniques to satisfy resource constraints. Two dominant approaches are Post-Training Quantization (PTQ), which quantizes a pre-trained instruction-tuned model, and Quantization-Aware Fine-Tuning (QLoRA), which performs fine-tuning directly within the quantized weight space. While prior work primarily evaluates these approaches using downstream task accuracy, their impact on confidence calibration remains insufficiently explored. Calibration measures the alignment between a model’s predicted confidence and its actual correctness, making it a critical safety property for production systems. A model that is confidently incorrect can pose greater operational risk than one that expresses appropriate uncertainty. In this work, we empirically compare the calibration behavior of PTQ and QLoRA deployments using Gemma-2-2B as the base model. Evaluation is conducted on MMLU, ARC-Challenge, and TruthfulQA using Expected Calibration Error (ECE), Brier Score, Negative Log-Likelihood (NLL), and behavioral metrics such as overconfidence rate. We find that the evaluated PTQ configuration achieves higher task accuracy but substantially worse calibration, exhibiting an overconfidence rate of 29.7% and ECE of 0.293, compared to 0.9% and 0.038 respectively for QLoRA. This represents a 7.7× calibration gap despite PTQ achieving 23% higher task accuracy. We further note that this comparison is partially confounded by differences in instruction-tuning scale between the evaluated systems. Nevertheless, our results suggest that accuracy alone is an insufficient criterion for deployment decisions when calibration and reliability are safety-critical concerns.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sagar Sharma
University Hospitals of Leicester NHS Trust
Building similarity graph...
Analyzing shared references across papers
Loading...
Sagar Sharma (Sun,) studied this question.
synapsesocial.com/papers/6a02c364ce8c8c81e9640c0a — DOI: https://doi.org/10.5281/zenodo.20111217