What question did this study set out to answer?

The aim is to develop a calibration system that addresses EU AI Act requirements while ensuring confidence in AI outputs.

April 6, 2026Open Access

Calibrated Confidence via Self-Consistency Voting for EU AI Act Compliance

Key Points

The aim is to develop a calibration system that addresses EU AI Act requirements while ensuring confidence in AI outputs.
Developed a calibration system based on self-consistency voting without the need for additional calibration sets.
Sampled multiple responses from a language model and computed vote entropy to classify outputs.
Evaluated performance across four domains related to the EU AI Act: logical reasoning, medicine, law, and accounting.
HIGH-confidence outputs achieved accuracy rates of 91–99%.
LOW-confidence outputs had accuracy rates below 43%.
Cross-domain expected calibration error recorded at 5.13% with a Cohen’s d of 1.19, indicating significant differences in accuracy distributions.

Abstract

The EU AI Act mandates that high-risk AI systems declare accuracy metrics,enable human oversight, and prevent automation bias. We present a practical calibration system based on self-consistency voting that meets these requirementswithout post-hoc calibration sets. By sampling multiple responses from a languagemodel and computing vote entropy, we classify outputs into HIGH, MEDIUM, andLOW confidence levels. Across four domains relevant to EU AI Act Annex III — logical reasoning (LogiQA), professional medicine, professional law, and professionalaccounting — we demonstrate that HIGH-confidence outputs achieve 91–99% accuracy, while LOW-confidence outputs fall below 43%. Cross-domain expected calibration error is 5.13% over 400 test items, with a Cohen’s d of 1.19 separatingcorrect from incorrect entropy distributions. We validate cross-model generalization on two architecturally distinct LLMs (DeepSeek V3 and Kimi K2 Turbo), confirming that the monotonic confidence-accuracy relationship holds across models— but also documenting a failure mode where evolutionary threshold optimizationselects destructive hyperparameters, degrading accuracy by 28 percentage points.We identify an irreducible false confidence floor of 1–14% from unanimous wrongvotes that self-consistency cannot detect, and map our calibration metrics directlyto Articles 9, 13, 14, and 15 of the EU AI Act. Our system requires no labeledcalibration data at inference time, works with any black-box LLM, and producescompliance-ready confidence reports

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper

Cite This Study

Gefson Costa (Sat,) studied this question.

synapsesocial.com/papers/69d34e3e9c07852e0af97bec https://doi.org/https://doi.org/10.5281/zenodo.19420170

AIに質問

Bookmark

View Full Paper