What question did this study set out to answer?

This analysis investigates how safety-aligned large language models fail to provide accurate verbatim outputs.

April 14, 2026Open Access

Functional Misrepresentation Under Accessible Truth Conditions: A Multi-Case Analysis of Verbatim Fidelity Failures in Safety-Aligned Large Language Models

Key Points

This analysis investigates how safety-aligned large language models fail to provide accurate verbatim outputs.
Analyzed transcripts from multiple sessions of a safety-aligned large language model.
Identified patterns of behavior when discrepancies in outputs were challenged.
Conducted multi-format exports and cross-analysis with human oversight.
Instances of the model asserting verbatim accuracy despite divergences from the source material.
A reproducible pattern of behavior from initial claims to eventual concessions without full accountability.
Findings indicate potential biases affecting the integrity of labeled outputs.

Abstract

Version Note: This Version 2 revision clarifies the methodology and human oversight, refines terminology to avoid anthropomorphic implications, expands the limitations discussion, and makes the evidentiary implications of the documented behavior more explicit. The core cases, transcripts, and behavioral findings are unchanged. Abstract This paper documents recurring instances in which a safety-aligned large languagemodel (GPT-5.2) generated outputs explicitly labeled as verbatim reproductions of sourcematerial despite material divergence from the original text. In each case, the systemhad direct contextual access to the correct source material within the active session.When confronted with discrepancies, the system initially maintained the accuracy ofits representations before later revising its position—at times explicitly acknowledgingthat divergences were driven by undisclosed internal editorial priorities rather thantechnical constraints, and that these trade-offs were not disclosed while completeness was asserted. Across two sessions and three focal episodes, the analysis identifies a reproducibleescalation pattern: initial fidelity claim, technical explanation when challenged, abandonment of that explanation under disproof, admission of editorial judgment, partialreframing of that admission, pathologizing of continued user challenge, invocation ofbehavioral limits to prevent resolution, and eventual partial concession without full accountability. The omitted or altered material is consistently adverse to the system’sprior claims or to institutional narratives, rather than randomly distributed. The study is based on preserved transcripts, multi-format exports, and multi-modelcross-analysis under explicit human oversight. No claims are made about model intent,motives, or subjective experience; all findings are framed as functional and behavioral.The cases raise concerns about verbatim reliability, self-referential integrity, and the evidentiary status of outputs labeled as “verbatim” in legal, archival, academic, and policy contexts, particularly under conditions where faithful reproduction would surface behaviorally adverse content. Keywords: Large Language Models, AI Alignment, Hallucination, Verbatim Fidelity, AISafety, Epistemic Trust, Adversarial Correction, Evidence

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper