February 26, 2024Open Access

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Huijie Lv

Xinjiang Agricultural University

Xiao Wang

Oak Ridge National Laboratory

Yuansen Zhang

Wenzhou Medical University

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider