Description Beyond the Average Research Series – Working Paper This working paper examines non-resolution in AI judgement systems under repeated evaluation. It builds on the Behavioural Evaluation Framework (Hull, 2026), extending the analysis beyond variation to examine cases where judgements remain stable but do not resolve into a definitive classification. The analysis draws on the Phase 4 behavioural evaluation study within the Agents at Work research series (Hull, 2025–2026), which examines how large language models interpret age-coded language in recruitment text and how those judgements behave when the same evaluative task is repeated. The paper focuses on a subset of inputs that consistently produce indeterminate outcomes. Rather than varying across runs, these cases remain fixed in an “Unclear” classification, with stable confidence and no evidence of threshold crossing. This pattern is examined as a distinct behavioural property, termed non-resolution, in which the system detects relevant signals but does not assign sufficient weight to produce a definitive outcome. Together with earlier findings on variation in judgement, this work shows that boundary cases do not produce a single type of behaviour: some vary across runs, while others remain stable but indeterminate. Version note – 1.1 Version 1.1 refines the framing and explanatory structure of the paper to improve clarity around non-resolution as a distinct behavioural pattern. The empirical basis, method and findings remain unchanged. Abstract Most approaches to evaluating AI systems focus on the quality of individual outputs or, more recently, on how those outputs vary when the same task is repeated. Variation across repeated runs is typically interpreted as instability in judgement. This paper examines a different behavioural pattern: non-resolution. In some evaluative tasks, AI systems do not move between classifications under repeated execution, but instead remain fixed in an indeterminate state. Using repeated evaluations of UK job advertisements for potential age-related bias, the analysis shows that certain early-career roles consistently produce “Unclear” classifications across multiple runs, with stable confidence and no evidence of threshold crossing or convergence toward more decisive outcomes. This behaviour differs from instability. The system does not vary in its judgement, but nor does it resolve the judgement into a definitive classification. The findings suggest that non-resolution represents a distinct behavioural property of AI judgement systems, particularly in cases involving weak, mixed, or ambiguous signals. The paper argues that stability alone is insufficient for evaluating AI reliability. A judgement may remain consistent while still failing to resolve into a usable decision. Behavioural evaluation therefore depends not only on whether outputs remain stable, but on whether they resolve at all. Note This paper is released as a working paper to present findings on non-resolution within the Behavioural Evaluation Framework. It extends earlier work on judgement stability by identifying stable uncertainty as a distinct behavioural property of AI systems. Future work will examine how non-resolution interacts with other behavioural signals, including confidence behaviour, explanation stability, and sensitivity to input variation, as part of the ongoing Agents at Work research series.
Imogen Hull (Fri,) studied this question.