Large Language Models (LLMs) have demonstrated remarkable success across diverse applications, yet their susceptibility to malicious exploitation remains a critical challenge. Notably, LLMs are known to be vulnerable to jailbreaking attacks, where adversaries craft malicious inputs to induce harmful or unethical outputs. In this paper, motivated by the unique effectiveness and scalability of In-Context Learning (ICL) in LLMs, we explore its potential to modulate the safety alignment of LLMs. Specifically, we propose the In-Context Attack (ICA), which employs harmful demonstrations to subvert LLMs' safety, and the In-Context Defense (ICD), which bolsters their resilience through examples that demonstrate refusal to produce harmful responses. By adjusting the distribution of safety in LLM outputs through adversarial demonstrations, our proposed in-context attack and defense facilitate effective manipulation of their alignment. We first provide theoretical insights to illustrate how minimal in-context demonstrations can efficiently alter safety alignment. Empirically, we validate ICA and ICD across multiple models, datasets, and attack baselines, showing their efficacy and scalability for red-teaming evaluations and robust safeguards for real-world deployment. Overall, our work unveils the pivotal yet understudied role of ICL in LLM safety, opening new avenues for understanding and improving them.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zeming Wei
Yue Wang
Ang Li
IEEE Transactions on Pattern Analysis and Machine Intelligence
MIT University
Artificial Intelligence in Medicine (Canada)
Department of Mathematical Sciences
Building similarity graph...
Analyzing shared references across papers
Loading...
Wei et al. (Thu,) studied this question.
www.synapsesocial.com/papers/698434cff1d9ada3c1fb3605 — DOI: https://doi.org/10.1109/tpami.2026.3660147