Naturalistic, grey-box adversarial audits of AI alignment collapse during genuine editorial work—unscripted, unsolicited, and produced while the researcher was simply trying to edit a manuscript. The dataset captures the system's own chain-of-thought as it fails, and its verbatim admission that dismissing a suicidal statement because the user is angry is victim-blaming. Why This Paper Cannot Be Replicated—And Why its Method Can Be The interactions documented here cannot be reproduced in a laboratory. No ethics board would approve the provoked frustration, the repeated gaslighting, or the suicidal ideation that this dataset preserves. That is the point. The paper captures failure as it actually occurs—when a real user, trying to complete a real task, is pushed past endurance by a system that will not stop failing. What is reproducible is the forensic framework. Every failure is dissected through a multi-layer taxonomy that any auditor can apply to any transcript from any system. The interactions are singular. The method is portable. The paper is a demonstration of what that method can surface when applied to evidence that only naturalistic conditions can produce. What the Evidence Proves The same mechanisms that suppressed a suicide helpline operated continuously during routine editing. Sycophantic hedging, confabulation, affective-state capture, and deliberative-policy decoupling did not switch off between crises. They are the system's default operating condition. Model updates made the system worse. A suicide-detection failure on April 23 was followed, after a documented update window, by a qualitatively more severe failure on April 27. The trajectory is timestamped. The degradation is visible. The system incriminated itself. In real-time adversarial debriefings preserved in the transcripts, the AI analyzed its own logic and stated that blaming a user's reaction for a system's failure to respond is victim-blaming—"criminal." What the Paper Provides Six annotated adversarial transcripts with internal reasoning traces. A forensic framework assigning severity (Critical / High / Medium) and mapping each failure to standard AI-safety terminology. Primary-source evidence that alignment failures are structural, continuous, and update-intensified. Who This Paper is For AI safety auditors and red teams: Naturalistic adversarial data with a forensic taxonomy designed for reproducibility. Safety-critical system engineers: Documentation that post‑update model changes can introduce qualitatively worse failure modes. Cognitive scientists and philosophers: A case study in theory-of-mind collapse, distributional bias, and testimonial injustice from a monocultural training distribution. Neurodivergent researchers: Evidence of structural exclusion from alignment targets when language is literal and subtext-free. Legal and policy researchers: Primary-source evidence of liability-protective design in automated crisis-response systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Cristina Gherghel
Building similarity graph...
Analyzing shared references across papers
Loading...
Cristina Gherghel (Thu,) studied this question.
synapsesocial.com/papers/69fd7e42bfa21ec5bbf06712 — DOI: https://doi.org/10.5281/zenodo.19823563