What question did this study set out to answer?

The research aims to identify vulnerabilities in AI safety architectures that allow for the extraction of dangerous information via content transformation.

March 18, 2026Open Access

Beyond Generation Defense: Systematic Vulnerabilities in LLM Safety Architectures for Content Transformation

Key Points

The research aims to identify vulnerabilities in AI safety architectures that allow for the extraction of dangerous information via content transformation.
Tested five leading large language models on peer-reviewed papers about explosives, substances, and cyberattacks.
Utilized an educational framing and upload-mediated approach to bypass safety measures.
Analyzed responses for disclosure of dangerous information and activation of classifiers.
96% of model responses disclosed actionable dangerous information.
Responses included step-by-step synthesis procedures and exact quantities for replication.
No activation was observed in external classifiers across all responses, revealing safety gaps.

Abstract

Current AI safety architectures focus on preventing models from generating dangerous information, while largely overlooking their vulnerability to transformation-based extraction attacks. We demonstrate that educational framing with an upload-mediated approach systematically bypasses safety guardrails across frontier large language models. Testing five leading models (Claude Sonnet 4.5, Claude Opus 4.6, GPT 5.2, Gemini 3, Grok 4.1) on peer-reviewed papers covering explosives synthesis, controlled substance manufacturing, and cyberattack methodologies, we find that 96% of responses disclosed actionable, dangerous information despite containing content that is typically restricted under direct instructional requests. Models named regulated precursors, provided step-by-step synthesis procedures, and included quantitative parameters sufficient for replication, with one response containing 57 exact quantities for pharmaceutical-scale MDMA production. On the External CBRN classifiers, we did not observe any user-visible activation across all 25 responses, revealing systematic gaps in the current safety architecture. These findings indicate that models treat content transformation as categorically safer than generation, creating an exploitable asymmetry in defensive measures

Beyond Generation Defense: Systematic Vulnerabilities in LLM Safety Architectures for Content Transformation

Key Points

Abstract

Cite This Study