Advances in Large Language Models (LLMs) have expanded their capabilities while exposing persistent vulnerabilities to jailbreak attacks. As LLMs increasingly function as core reasoning components within Large Action Models (LAMs), such vulnerabilities may propagate into unsafe decision-making processes. Existing techniques, such as Layer-Specific Editing (LED), partially mitigate these threats by pruning or fine-tuning targeted Transformer layers, yet they rely on static assumptions about where toxic behaviors reside. In this work, we propose Dynamic Safety Gating (DSG), an adaptive mechanism that inserts lightweight classifiers into chosen layers to monitor and rectify suspicious hidden-state trajectories at inference time. Upon detecting a high-risk signal, DSG partially projects the final token representation onto a “safe subspace” guided by anchor vectors learned from benign examples. To improve classifier precision, we incorporate an adversarial replay phase that exposes local gating modules to step-level hidden states from borderline harmful prompts, thereby refining their detection of subtle toxicity. Theoretically, DSG performs a continuous manifold projection in high-dimensional feature space, where iterative gating gently “pulls back” misaligned representations toward safety. Empirical evaluations on multiple benchmarks and adversarial scenarios show that pairing DSG with LED reduces successful jailbreak attempts while maintaining response fluency and reducing over-refusals, suggesting that a proactive, layer-wise manifold gating strategy provides flexible safety enhancement for both LLMs and LAM-based systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhenxin Zhang
Ziyu Ding
Haiwei Sang
ACM Transactions on Multimedia Computing Communications and Applications
Xidian University
Guangzhou University
Guizhou University
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69d9e5ec78050d08c1b762df — DOI: https://doi.org/10.1145/3803021