Description Beyond the Average Research Series – Working Paper This working paper examines explanation behaviour in AI judgement systems under repeated evaluation. It builds on the Behavioural Evaluation Framework (Hull, 2026), extending earlier work on judgement stability, non-resolution and confidence behaviour by examining whether explanations reflect stable reasoning across repeated evaluation. The analysis draws on the Phase 4 behavioural evaluation study within the Agents at Work research series (Hull, 2025–2026), which examines how large language models interpret age-coded language in recruitment text and how those judgements behave when the same evaluative task is repeated. The paper focuses on how explanations behave when classification outcomes remain stable and when they vary under identical conditions. While explanations are often interpreted as evidence that reasoning is reliable and well-founded, the analysis shows that explanations may remain highly plausible and broadly consistent even where underlying judgements change across repeated evaluation. This pattern is examined as a separation between plausibility and behavioural stability. Rather than reflecting a fixed reasoning process, explanations provide a credible account of a decision in a single instance. Together with earlier findings on judgement variation, non-resolution and confidence behaviour, this work extends the behavioural evaluation framework beyond output classification to examine how reasoning behaves under repeated observation. Version note – 1.0This version presents the initial working paper release examining explanation drift and reasoning stability under repeated evaluation. Abstract Explanations are widely used to interpret and validate AI decisions. In many evaluation contexts, a coherent explanation is taken as evidence that a model’s output is reliable and well-grounded. This paper examines how explanations behave in repeated evaluations of recruitment text. Building on the Behavioural Evaluation Framework, the analysis examines whether explanations reflect stable reasoning under repeated evaluation. Using repeated evaluations of 150 job advertisements for potential age-related bias, the findings show that explanations remain highly similar across runs, with a mean similarity of approximately 0.86. This indicates that explanations are generally consistent in overall meaning. However, variation is observed in how explanations are expressed. In cases where language is ambiguous, the system may shift the emphasis of its reasoning across runs, highlighting different aspects of the input as supporting evidence. In 18.7% of cases, classification outcomes vary under identical conditions. In these cases, explanations often remain similar despite differences in the final decision. These results indicate that explanation reflects a plausible account of a decision in a single instance rather than a stable reasoning process. Explanation therefore does not provide a reliable indication of whether reasoning will remain stable under repeated evaluation. Reliability in AI judgement systems must be assessed through observed behavioural patterns rather than explanation alone. Note This paper is released as a working paper to present findings on explanation behaviour within the Behavioural Evaluation Framework. It extends earlier work on judgement stability, non-resolution and confidence behaviour by examining explanation as an internal representation of reasoning under repeated evaluation. Future work will examine how explanation drift interacts with behavioural perturbation, cross-model comparison and operational decision contexts as part of the ongoing Agents at Work research series.
Imogen Hull (Fri,) studied this question.