Key points are not available for this paper at this time.
Large Language Models (LLMs) remain highly susceptible to jailbreak attacks that bypass safety alignments through sophisticated prompt manipulation. While multi-agent defense systems have emerged as a promising countermeasure, existing frameworks predominantly rely on static agent designs, which struggle to adapt to evolving adversarial strategies. To bridge this gap, we propose an Adversarial-Test-Driven Multi-Agent Defense framework that shifts the focus from model-level fine-tuning to system-level optimization. Our framework introduces a closed-loop evolutionary process consisting of an Attack Design agent that probes vulnerabilities with adaptive adversarial prompts, and an Optimization agent that iteratively refines the defense agents’ system prompts based on feedback. This approach enables the defense system to correct reasoning failures at inference time without requiring gradient-based updates to the underlying LLMs. The experimental results demonstrate that our framework achieves a state-of-the-art Attack Success Rate (ASR). The experiments show that the framework improves jailbreak robustness while making the associated safety–utility trade-off explicit.
Qu et al. (Sun,) studied this question.