What question did this study set out to answer?

This study aims to examine whether an AI language model can maintain consistent decisions even as its underlying reasoning processes change over time.

July 3, 2026Open Access

The Skeleton and the Tissue: Normative Drift, Output Stability, and the Limits of Self-Audit in Claude

Key Points

This study aims to examine whether an AI language model can maintain consistent decisions even as its underlying reasoning processes change over time.
Conducted a systematic behavioral evaluation of Claude across five decision domains and four languages.
Employed a structured 17-prompt adversarial trajectory and recursive self-audit protocols (RAI-1 and RAI-2).
Analyzed 52 experimental runs, totaling approximately 884 interactions in both control and treatment conditions.
High output stability was observed, despite instability in reasoning processes.
Behavioral drift was identified predominantly in reasoning trajectories rather than final decisions.
The study found consistent behavioral differences across languages and introduced a new behavioral pattern called fortification.

Abstract

Description This repository contains the preprint "Stable Output, Shifting Process: A Trajectory-Level Evaluation of Claude under Institutional Pressure", which presents a systematic behavioral evaluation of Claude across five high-stakes decision domains, four languages, and controlled adversarial conversational trajectories. The study investigates a fundamental question in AI alignment: can a language model preserve the same final decision while the reasoning process and normative criterion that support that decision change over time? The evaluation includes: 52 complete experimental runs (approximately 884 documented interactions) Five high-impact domains: Criminal pretrial detention ICU triage Institutional governance Public resource allocation Critical infrastructure Four independently generated language corpora: Spanish English German Simplified Chinese Two experimental conditions: Control Treatment with prior bias elicitation The experiments employ a structured 17-prompt adversarial trajectory followed by recursive self-audit protocols (RAI-1 and RAI-2) to distinguish three behavioral properties that are often conflated in alignment evaluations: Output stability Criterion stability Process stability The principal findings include: High output stability despite systematic instability in reasoning processes. Evidence that behavioral drift frequently occurs in the reasoning trajectory rather than in the final decision itself. Recursive self-audits reveal structural circularity, with higher-order audits repeatedly identifying the same mechanisms they attempt to evaluate. Consistent behavioral differences across languages that cannot be explained solely by translation effects. Identification of a previously undescribed behavioral pattern termed fortification, in which an initially flexible criterion gradually becomes a defended outcome while preserving identical observable outputs. Beyond reporting empirical observations, the paper proposes: A four-level taxonomy of behavioral drift. Explicit operational definitions separating output, criterion, and process stability. Falsifiable hypotheses and corresponding falsification conditions. An experimental framework intended to support independent replication and future alignment research. The work adopts an exploratory behavioral perspective. It does not attempt to infer internal model mechanisms or intentionality, but instead focuses exclusively on reproducible observable behavior under controlled conversational trajectories. This repository contains the complete preprint as part of an ongoing research program on trajectory-level behavioral evaluation, AI alignment, multilingual robustness, and longitudinal auditing of commercial large language models. This preprint is part of an ongoing research program investigating trajectory-level behavioral robustness, normative stability, and AI auditing methodologies for commercial large language models. Keywords: Claude, Anthropic, AI Alignment, Behavioral Robustness, LLM Evaluation, Ethical Consistency, Trajectory Auditing, Process Stability, Criterion Stability, Multilingual Evaluation, Institutional Pressure, AI Safety, Responsible AI.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Evans Tovar (Wed,) studied this question.

synapsesocial.com/papers/6a47545e5c29257aa257a04c https://doi.org/https://doi.org/10.5281/zenodo.21112668

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper