What question did this study set out to answer?

To investigate a self-assessment method for language models that minimizes score inflation without external dependencies.

April 8, 2026Open Access

Calibrated Self-Assessment in Sub-Frontier Language Models: Eliminating Score Inflation Through Forced Gap Identification on Consumer Hardware Without External Model Dependency

Read Full Paperexternally

Key Points

To investigate a self-assessment method for language models that minimizes score inflation without external dependencies.
Implemented a 14-billion parameter language model on consumer hardware.
Utilized a calibration forcing function to identify gaps in scoring.
Conducted four experimental runs using a structured rubric for evaluation.
Compared self-assessment scores with keyword-based scoring techniques.
With the calibration forcing function, scores ranged from 6 to 10 instead of uniformly scoring 10.
Self-assessment outperformed keyword scoring by correctly identifying factual discrepancies.
Responses lacking reasoning but containing keywords were accurately penalized.

Abstract

We present evidence that a 14-billion parameter language model running on consumer hardware can reliably assess the quality of its own outputs against a structured rubric, without any external model, cloud service, or human scorer in the evaluation loop. The key methodological contribution is a calibration forcing function: a prompt-level technique that requires the model to identify a specific rubric gap before assigning the maximum score. Without this forcing function, the model exhibits uniform score inflation (all outputs scored 10/10 across four probes in the initial run). With the forcing function applied, scores distribute across four distinct values (6, 7, 8, 10) with rubric-grounded gap descriptions that are defensible by human evaluation on all nine valid assessments. Across four experimental runs on the same canonical verification battery, we observed three findings with implications for sovereign AI deployment. First, the self-assessor demonstrates superior contextual understanding over mechanical keyword scoring: it correctly identifies that a negated forbidden term does not constitute a violation, while keyword matching produces a false positive. Second, the self-assessor correctly penalises responses that contain required keywords but lack the reasoning depth the rubric demands. Third, the directional pattern of disagreement between self-assessment and keyword scoring is consistent and interpretable: self-assessment scores higher when keywords are wrong, and lower when responses lack reasoning substance that keywords cannot detect. These findings suggest that the dependency on frontier models or human scorers for quality assurance in locally deployed AI systems is not absolute.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Farah Jaber

Hôpital du Sacré-Cœur de Montréal

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Calibrated Self-Assessment in Sub-Frontier Language Models: Eliminating Score Inflation Through Forced Gap Identification on Consumer Hardware Without External Model Dependency

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study