What question did this study set out to answer?

The aim is to develop a hybrid framework (AB-JB) that effectively exploits vulnerabilities in large language models through adversarial attacks.

April 8, 2026Open Access

AB jailbreaking - a novel hybrid framework for exploitation of adversarial vulnerabilities in LLMs

Key Points

The aim is to develop a hybrid framework (AB-JB) that effectively exploits vulnerabilities in large language models through adversarial attacks.
Combines black-box semantic adversarial prompt generation with embedding-level optimization.
Utilizes an attacker LLM to create diverse adversarial variants and a judge LLM for filtering.
Implements suffix-only embedding optimization with strict iteration limits.
Achieved an average attack success rate of 93% across various datasets.
Attained 99% success on the Malicious-Instruct benchmark using a larger model for variant generation.
Demonstrated effective cross-model performance across multiple 7B-scale language models.

Abstract

Large language models (LLMs) have advanced rapidly but remain vulnerable to adversarial “jailbreaking” attacks that elicit harmful or disallowed outputs. We propose AB-JB, a three-stage hybrid jailbreak framework that combines black-box semantic adversarial prompt variant generation with a compact, regularised embedding-level suffix optimiser that discretises to legal tokens. AB-JB first uses an attacker LLM to produce multiple semantically diverse adversarial variants for each harmful behaviour and a judge LLM to score and filter these variants into a high-quality candidate pool. It then performs suffix-only embedding optimization with ₂ regularization, per-iteration nearest-neighbour projection, and a strict iteration cap to obtain valid adversarial token suffixes under a bounded computational budget. We evaluate AB-JB on four adversarial benchmarks (AdvBench, HarmBench, JailbreakBench, Malicious-Instruct) against five popular 7B-parameter models (Llama2, Falcon, Vicuna, Mistral, MPT). Across these settings, AB-JB achieves an average of 93% dataset-level attack success rate (ASR-DS), while per-variant success (ASR-APV) averages 55. 7%. On Malicious-Instruct we observe near-complete dataset success (99% ASR-DS), which we attribute to using a larger commercial model (Gemini 2. 5 Flash) as the attacker when generating variants for this dataset. Compared with token-level gradient attacks, prompt-level search, and soft-prompt methods, our experiments indicate that AB-JB offers a practical compromise between attack success, cross-model performance across 7B-scale models, and compute efficiency, enabled by judge-guided variant selection and a 22-iteration suffix optimization cap. These results underline persistent alignment gaps and motivate adversarially informed defences. The present study is limited to 7B open-weight models and assumes white-box access for the suffix optimization stage.

Demander à l'IA

Bookmark

View Full Paper