What question did this study set out to answer?

This work aims to address vulnerabilities in large language models exposed to jailbreak attacks by implementing a dynamic safety mechanism.

April 11, 2026

Dynamic Safety Gating for Large Action Models: Mitigating Jailbreak Exploits in Large Language Models

Key Points

This work aims to address vulnerabilities in large language models exposed to jailbreak attacks by implementing a dynamic safety mechanism.
Proposal of Dynamic Safety Gating (DSG) using lightweight classifiers in transformer layers.
Monitoring hidden-state trajectories at inference time to rectify suspicious signals.
Incorporation of an adversarial replay phase to improve detection of toxicity.
Empirical evaluations on multiple benchmarks to assess performance.
Pairing DSG with Layer-Specific Editing (LED) significantly reduces successful jailbreak attempts.
Maintains response fluency and lowers the rate of over-refusals.
Demonstrates effective safety enhancement for large language models and LAM systems.

Abstract

Advances in Large Language Models (LLMs) have expanded their capabilities while exposing persistent vulnerabilities to jailbreak attacks. As LLMs increasingly function as core reasoning components within Large Action Models (LAMs), such vulnerabilities may propagate into unsafe decision-making processes. Existing techniques, such as Layer-Specific Editing (LED), partially mitigate these threats by pruning or fine-tuning targeted Transformer layers, yet they rely on static assumptions about where toxic behaviors reside. In this work, we propose Dynamic Safety Gating (DSG), an adaptive mechanism that inserts lightweight classifiers into chosen layers to monitor and rectify suspicious hidden-state trajectories at inference time. Upon detecting a high-risk signal, DSG partially projects the final token representation onto a “safe subspace” guided by anchor vectors learned from benign examples. To improve classifier precision, we incorporate an adversarial replay phase that exposes local gating modules to step-level hidden states from borderline harmful prompts, thereby refining their detection of subtle toxicity. Theoretically, DSG performs a continuous manifold projection in high-dimensional feature space, where iterative gating gently “pulls back” misaligned representations toward safety. Empirical evaluations on multiple benchmarks and adversarial scenarios show that pairing DSG with LED reduces successful jailbreak attempts while maintaining response fluency and reducing over-refusals, suggesting that a proactive, layer-wise manifold gating strategy provides flexible safety enhancement for both LLMs and LAM-based systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zhenxin Zhang

Ziyu Ding

Haiwei Sang

Journals

ACM Transactions on Multimedia Computing Communications and Applications

Actions

Institutions

Xidian University

Guangzhou University

Guizhou University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Dynamic Safety Gating for Large Action Models: Mitigating Jailbreak Exploits in Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study