What question did this study set out to answer?

This work aims to compare the calibration of confidence in two LLM deployment approaches: PTQ and QLoRA.

May 12, 2026Open Access

Confidence Without Competence: Calibration Risks in Real-World LLM Deployment Pipelines

Key Points

This work aims to compare the calibration of confidence in two LLM deployment approaches: PTQ and QLoRA.
Empirical evaluation of PTQ and QLoRA using Gemma-2-2B as the base model.
Utilized MMLU, ARC-Challenge, and TruthfulQA datasets for assessment.
Measured calibration using Expected Calibration Error (ECE), Brier Score, Negative Log-Likelihood (NLL), and overconfidence rate.
PTQ achieved 29.7% overconfidence rate and ECE of 0.293, while QLoRA had 0.9% overconfidence and ECE of 0.038.
PTQ exhibited a 7.7× calibration gap despite achieving 23% higher task accuracy.
Differences in instruction-tuning scale partially confounded the comparison.

Abstract

Large language model (LLM) deployment increasingly depends on model compression techniques to satisfy resource constraints. Two dominant approaches are Post-Training Quantization (PTQ), which quantizes a pre-trained instruction-tuned model, and Quantization-Aware Fine-Tuning (QLoRA), which performs fine-tuning directly within the quantized weight space. While prior work primarily evaluates these approaches using downstream task accuracy, their impact on confidence calibration remains insufficiently explored. Calibration measures the alignment between a model’s predicted confidence and its actual correctness, making it a critical safety property for production systems. A model that is confidently incorrect can pose greater operational risk than one that expresses appropriate uncertainty. In this work, we empirically compare the calibration behavior of PTQ and QLoRA deployments using Gemma-2-2B as the base model. Evaluation is conducted on MMLU, ARC-Challenge, and TruthfulQA using Expected Calibration Error (ECE), Brier Score, Negative Log-Likelihood (NLL), and behavioral metrics such as overconfidence rate. We find that the evaluated PTQ configuration achieves higher task accuracy but substantially worse calibration, exhibiting an overconfidence rate of 29.7% and ECE of 0.293, compared to 0.9% and 0.038 respectively for QLoRA. This represents a 7.7× calibration gap despite PTQ achieving 23% higher task accuracy. We further note that this comparison is partially confounded by differences in instruction-tuning scale between the evaluated systems. Nevertheless, our results suggest that accuracy alone is an insufficient criterion for deployment decisions when calibration and reliability are safety-critical concerns.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Sagar Sharma

University Hospitals of Leicester NHS Trust

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Confidence Without Competence: Calibration Risks in Real-World LLM Deployment Pipelines

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study