February 28, 2024Open Access

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Key Points

Key points are not available for this paper at this time.

Abstract

In recent years, large language models (LLMs) have demonstrated notable success across various tasks, but the trustworthiness of LLMs is still an open problem. One specific threat is the potential to generate toxic or harmful responses. Attackers can craft adversarial prompts that induce harmful responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA (Disguise and Reconstruction Attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. We evaluate DRA across various open-source and close-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA boasts a 90\% attack success rate on LLM chatbots GPT-4.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Liu et al. (Wed,) studied this question.

synapsesocial.com/papers/68e77226b6db6435876e7a03 https://doi.org/https://doi.org/10.48550/arxiv.2402.18104

KI fragen

Bookmark

View Full Paper