Description Beyond the Average Research Series – Working Paper This working paper examines confidence behaviour in AI judgement systems under repeated evaluation. It builds on the Behavioural Evaluation Framework (Hull, 2026), extending earlier work on judgement stability and non-resolution by examining whether confidence reflects underlying behavioural reliability. The analysis draws on the Phase 4 behavioural evaluation study within the Agents at Work research series (Hull, 2025–2026), which examines how large language models interpret age-coded language in recruitment text and how those judgements behave when the same evaluative task is repeated. The paper focuses on how confidence values behave when classification outcomes remain stable and when they vary under identical conditions. While confidence is commonly interpreted as an indicator of reliability, the analysis shows that confidence values remain comparatively stable even where underlying judgements change across repeated evaluation. This pattern is examined as a distinct separation between expressed certainty and behavioural stability. Rather than indicating whether a judgement remains stable across repeated runs, confidence reflects how strongly a decision is expressed in a single instance. Together with earlier findings on judgement variation and non-resolution, this work extends the behavioural evaluation framework beyond output classification to examine how internal signals behave under repeated observation. Version note – 1.0This version presents the initial working paper release examining confidence behaviour as an internal signal within AI judgement systems under repeated evaluation. Abstract Confidence scores are widely used as indicators of reliability in AI judgement systems. Higher confidence is often treated as evidence that a decision is dependable. This paper examines how confidence behaves in repeated evaluations of recruitment text. Building on the Behavioural Evaluation Framework, the analysis examines whether confidence reflects underlying judgement stability under repeated evaluation. Using repeated evaluations of 150 job advertisements for potential age-related bias, the findings show that confidence remains highly stable across runs, typically within a narrow range centred around 0.60–0.62. This stability persists even in cases where classification outcomes vary across repeated evaluation. In 18.7% of cases, judgements change under identical conditions, most often between adjacent categories such as “Potentially Biased” and “Unclear”. However, confidence does not adjust in response to this variation and may in some cases be higher in unstable cases than in stable ones. These results indicate that confidence reflects the strength of expression of a decision rather than its behavioural stability. Confidence therefore does not provide a reliable indication of whether a judgement will remain stable under repeated evaluation. Reliability must be assessed through observed behavioural patterns rather than confidence alone. Note This paper is released as a working paper to present findings on confidence behaviour within the Behavioural Evaluation Framework. It extends earlier work on judgement stability and non-resolution by examining confidence as an internal signal under repeated evaluation. Future work will examine how confidence behaviour interacts with explanation stability, cross-model comparison, and sensitivity to input variation as part of the ongoing Agents at Work research series.
Imogen Hull (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: