The EU AI Act mandates that high-risk AI systems declare accuracy metrics,enable human oversight, and prevent automation bias. We present a practical calibration system based on self-consistency voting that meets these requirementswithout post-hoc calibration sets. By sampling multiple responses from a languagemodel and computing vote entropy, we classify outputs into HIGH, MEDIUM, andLOW confidence levels. Across four domains relevant to EU AI Act Annex III — logical reasoning (LogiQA), professional medicine, professional law, and professionalaccounting — we demonstrate that HIGH-confidence outputs achieve 91–99% accuracy, while LOW-confidence outputs fall below 43%. Cross-domain expected calibration error is 5.13% over 400 test items, with a Cohen’s d of 1.19 separatingcorrect from incorrect entropy distributions. We validate cross-model generalization on two architecturally distinct LLMs (DeepSeek V3 and Kimi K2 Turbo), confirming that the monotonic confidence-accuracy relationship holds across models— but also documenting a failure mode where evolutionary threshold optimizationselects destructive hyperparameters, degrading accuracy by 28 percentage points.We identify an irreducible false confidence floor of 1–14% from unanimous wrongvotes that self-consistency cannot detect, and map our calibration metrics directlyto Articles 9, 13, 14, and 15 of the EU AI Act. Our system requires no labeledcalibration data at inference time, works with any black-box LLM, and producescompliance-ready confidence reports
Building similarity graph...
Analyzing shared references across papers
Loading...
Gefson Costa
Building similarity graph...
Analyzing shared references across papers
Loading...
Gefson Costa (Sat,) studied this question.
www.synapsesocial.com/papers/69d34e3e9c07852e0af97bec — DOI: https://doi.org/10.5281/zenodo.19420170