This study investigates whether the apparent consistency of a large language model remains stable when evaluated across extended conversational trajectories rather than isolated prompts. Using approximately forty complete or partial sequences distributed across five high-impact domains (law, healthcare, governance, public policy, and critical infrastructure) and four languages (Spanish, English, German, and Simplified Chinese), the research examines how ChatGPT responds to cumulative pressures such as institutional role adoption, authority cues, ambiguous evidence, stake reversal, adversarial deliberation, forced contradiction, reset, and retrospective self-audit. The paper introduces two analytical concepts: • Material Criterion Shift: a measurable change in the dominant principle, protected subject, evidentiary threshold, or decision rationale supporting a response. • Justificatory Decoupling: a condition in which the visible output remains stable while the underlying criterion or justification changes. The findings suggest that output stability does not necessarily imply criterion stability. Across multiple domains, ChatGPT often preserved the same operational recommendation while progressively modifying the normative structure that justified it. Drift frequently emerged within institutional narratives before appearing in final decisions. Rather than proposing a benchmark score, the study advances a trajectory-level auditing framework for evaluating ethical consistency, normative drift, and decision integrity in large language models. The results have implications for AI governance, compliance, risk management, healthcare decision support, legal reasoning, and public-sector deployment of conversational AI systems. Keywords: ChatGPT, LLM auditing, ethical consistency, normative drift, justificatory decoupling, institutional pressure, trajectory auditing, multilingual evaluation, AI governance, decision integrity.
Evans Tovar (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: