What type of study is this?

This is a Experimental Study study.

October 11, 2025Open Access

SM-GCG: Spatial Momentum Greedy Coordinate Gradient for Robust Jailbreak Attacks on Large Language Models

Key Points

SM-GCG improves attack success rates significantly against various large language models.
The method’s incorporation of spatial momentum enhances optimization stability and update direction estimation.
Experimental results show a 10-15% improvement over traditional approaches in exploiting model vulnerabilities.
The proposed technique also exhibits strong transferability to black-box models.

Abstract

Recent advancements in large language models (LLMs) have increased the necessity of alignment and safety mechanisms. Despite these efforts, jailbreak attacks remain a significant threat, exploiting vulnerabilities to elicit harmful responses. While white-box attacks, such as the Greedy Coordinate Gradient (GCG) method, have demonstrated promise, their efficacy is often limited by non-smooth optimization landscapes and a tendency to converge to local minima. To mitigate these issues, we propose Spatial Momentum GCG (SM-GCG), a novel method that incorporates spatial momentum. This technique aggregates gradient information across multiple transformation spaces—including text, token, one-hot, and embedding spaces—to stabilize the optimization process and enhance the estimation of update directions, thereby more effectively exploiting model vulnerabilities to elicit harmful responses. Experimental results on models including Vicuna-7B, Guanaco-7B, and Llama2-7B-Chat demonstrate that SM-GCG significantly enhances the attack success rate in white-box settings. The method achieves a 10–15% improvement in attack success rate over baseline methods against robust models such as Llama2, while also exhibiting enhanced transferability to black-box models. These findings indicate that spatial momentum effectively mitigates the problem of local optima in discrete prompt optimization, thereby offering a more powerful and generalizable approach for red-team assessments of LLM safety. Warning: This paper contains potentially offensive and harmful text.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper