Recent advancements in large language models (LLMs) have increased the necessity of alignment and safety mechanisms. Despite these efforts, jailbreak attacks remain a significant threat, exploiting vulnerabilities to elicit harmful responses. While white-box attacks, such as the Greedy Coordinate Gradient (GCG) method, have demonstrated promise, their efficacy is often limited by non-smooth optimization landscapes and a tendency to converge to local minima. To mitigate these issues, we propose Spatial Momentum GCG (SM-GCG), a novel method that incorporates spatial momentum. This technique aggregates gradient information across multiple transformation spaces—including text, token, one-hot, and embedding spaces—to stabilize the optimization process and enhance the estimation of update directions, thereby more effectively exploiting model vulnerabilities to elicit harmful responses. Experimental results on models including Vicuna-7B, Guanaco-7B, and Llama2-7B-Chat demonstrate that SM-GCG significantly enhances the attack success rate in white-box settings. The method achieves a 10–15% improvement in attack success rate over baseline methods against robust models such as Llama2, while also exhibiting enhanced transferability to black-box models. These findings indicate that spatial momentum effectively mitigates the problem of local optima in discrete prompt optimization, thereby offering a more powerful and generalizable approach for red-team assessments of LLM safety. Warning: This paper contains potentially offensive and harmful text.
Gu et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: