What does this research mean for the field?

A self-evolving, multi-agent defense framework utilizing inference-time prompt optimization achieves state-of-the-art robustness against LLM jailbreak attacks without requiring gradient-based model updates. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

May 31, 2026Open Access

Adversarial-Test-Driven Multi-Agent LLM Defense: A Self-Evolving Framework via Inference-Time Prompt Optimization

Key Points

Key points are not available for this paper at this time.

Abstract

Large Language Models (LLMs) remain highly susceptible to jailbreak attacks that bypass safety alignments through sophisticated prompt manipulation. While multi-agent defense systems have emerged as a promising countermeasure, existing frameworks predominantly rely on static agent designs, which struggle to adapt to evolving adversarial strategies. To bridge this gap, we propose an Adversarial-Test-Driven Multi-Agent Defense framework that shifts the focus from model-level fine-tuning to system-level optimization. Our framework introduces a closed-loop evolutionary process consisting of an Attack Design agent that probes vulnerabilities with adaptive adversarial prompts, and an Optimization agent that iteratively refines the defense agents’ system prompts based on feedback. This approach enables the defense system to correct reasoning failures at inference time without requiring gradient-based updates to the underlying LLMs. The experimental results demonstrate that our framework achieves a state-of-the-art Attack Success Rate (ASR). The experiments show that the framework improves jailbreak robustness while making the associated safety–utility trade-off explicit.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Qu et al. (Sun,) studied this question.

synapsesocial.com/papers/6a1fb7bdfc2fd1e49fd4567d https://doi.org/https://doi.org/10.3390/electronics15112365

Bookmark

View Full Paper