What question did this study set out to answer?

The research aims to investigate the effectiveness of In-Context Learning in enhancing the safety alignment of language models against malicious exploitation.

February 5, 2026

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Puntos clave

The research aims to investigate the effectiveness of In-Context Learning in enhancing the safety alignment of language models against malicious exploitation.
Proposed In-Context Attack (ICA) and In-Context Defense (ICD) strategies.
Utilized minimal in-context demonstrations to manipulate safety responses.
Conducted empirical validations across various models, datasets, and attack scenarios.
Demonstrated effectiveness of ICA and ICD in altering safety alignment of LLM outputs.
Showed scalability of these methods for real-world deployment and red-teaming evaluations.
Provided theoretical insights supporting the manipulation of safety through in-context demonstrations.

Resumen

Large Language Models (LLMs) have demonstrated remarkable success across diverse applications, yet their susceptibility to malicious exploitation remains a critical challenge. Notably, LLMs are known to be vulnerable to jailbreaking attacks, where adversaries craft malicious inputs to induce harmful or unethical outputs. In this paper, motivated by the unique effectiveness and scalability of In-Context Learning (ICL) in LLMs, we explore its potential to modulate the safety alignment of LLMs. Specifically, we propose the In-Context Attack (ICA), which employs harmful demonstrations to subvert LLMs' safety, and the In-Context Defense (ICD), which bolsters their resilience through examples that demonstrate refusal to produce harmful responses. By adjusting the distribution of safety in LLM outputs through adversarial demonstrations, our proposed in-context attack and defense facilitate effective manipulation of their alignment. We first provide theoretical insights to illustrate how minimal in-context demonstrations can efficiently alter safety alignment. Empirically, we validate ICA and ICD across multiple models, datasets, and attack baselines, showing their efficacy and scalability for red-teaming evaluations and robust safeguards for real-world deployment. Overall, our work unveils the pivotal yet understudied role of ICL in LLM safety, opening new avenues for understanding and improving them.

Preguntar a la IA

Me gusta

Guardar