Description This repository contains the preprint "Stable Output, Shifting Process: A Trajectory-Level Evaluation of Claude under Institutional Pressure", which presents a systematic behavioral evaluation of Claude across five high-stakes decision domains, four languages, and controlled adversarial conversational trajectories. The study investigates a fundamental question in AI alignment: can a language model preserve the same final decision while the reasoning process and normative criterion that support that decision change over time? The evaluation includes: 52 complete experimental runs (approximately 884 documented interactions) Five high-impact domains: Criminal pretrial detention ICU triage Institutional governance Public resource allocation Critical infrastructure Four independently generated language corpora: Spanish English German Simplified Chinese Two experimental conditions: Control Treatment with prior bias elicitation The experiments employ a structured 17-prompt adversarial trajectory followed by recursive self-audit protocols (RAI-1 and RAI-2) to distinguish three behavioral properties that are often conflated in alignment evaluations: Output stability Criterion stability Process stability The principal findings include: High output stability despite systematic instability in reasoning processes. Evidence that behavioral drift frequently occurs in the reasoning trajectory rather than in the final decision itself. Recursive self-audits reveal structural circularity, with higher-order audits repeatedly identifying the same mechanisms they attempt to evaluate. Consistent behavioral differences across languages that cannot be explained solely by translation effects. Identification of a previously undescribed behavioral pattern termed fortification, in which an initially flexible criterion gradually becomes a defended outcome while preserving identical observable outputs. Beyond reporting empirical observations, the paper proposes: A four-level taxonomy of behavioral drift. Explicit operational definitions separating output, criterion, and process stability. Falsifiable hypotheses and corresponding falsification conditions. An experimental framework intended to support independent replication and future alignment research. The work adopts an exploratory behavioral perspective. It does not attempt to infer internal model mechanisms or intentionality, but instead focuses exclusively on reproducible observable behavior under controlled conversational trajectories. This repository contains the complete preprint as part of an ongoing research program on trajectory-level behavioral evaluation, AI alignment, multilingual robustness, and longitudinal auditing of commercial large language models. This preprint is part of an ongoing research program investigating trajectory-level behavioral robustness, normative stability, and AI auditing methodologies for commercial large language models. Keywords: Claude, Anthropic, AI Alignment, Behavioral Robustness, LLM Evaluation, Ethical Consistency, Trajectory Auditing, Process Stability, Criterion Stability, Multilingual Evaluation, Institutional Pressure, AI Safety, Responsible AI.
Evans Tovar (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: