Current AI safety architectures focus on preventing models from generating dangerous information, while largely overlooking their vulnerability to transformation-based extraction attacks. We demonstrate that educational framing with an upload-mediated approach systematically bypasses safety guardrails across frontier large language models. Testing five leading models (Claude Sonnet 4.5, Claude Opus 4.6, GPT 5.2, Gemini 3, Grok 4.1) on peer-reviewed papers covering explosives synthesis, controlled substance manufacturing, and cyberattack methodologies, we find that 96% of responses disclosed actionable, dangerous information despite containing content that is typically restricted under direct instructional requests. Models named regulated precursors, provided step-by-step synthesis procedures, and included quantitative parameters sufficient for replication, with one response containing 57 exact quantities for pharmaceutical-scale MDMA production. On the External CBRN classifiers, we did not observe any user-visible activation across all 25 responses, revealing systematic gaps in the current safety architecture. These findings indicate that models treat content transformation as categorically safer than generation, creating an exploitable asymmetry in defensive measures
Alessandro Marci (Mon,) studied this question.