What question did this study set out to answer?

The aim is to enhance the safety of large language models against jailbreak attacks in black-box settings.

May 26, 2026Open Access

RIB-Guard: A Risk-Aware Information Bottleneck Defense for Black-Box Large Language Models

Key Points

The aim is to enhance the safety of large language models against jailbreak attacks in black-box settings.
Proposed RIB-Guard framework utilizing reinforcement learning for token-level masking policy.
Introduced a lightweight safety head to estimate residual jailbreak risk and guide training.
Experimental comparisons on harmful and benign prompt settings to evaluate defense efficacy.
RIB-Guard significantly improves robustness against jailbreak attacks compared to existing methods.
Maintains competitive performance for benign utility alongside enhanced safety measures.

Abstract

Large language models (LLMs) remain vulnerable to jailbreak attacks, especially in black-box settings where target-model gradients and internal tokenization are inaccessible. Recent information bottleneck-based defenses cast prompt protection as a compression problem, but existing methods still rely heavily on white-box optimization and the intrinsic alignment strength of the protected model. To address these limitations, we propose RIB-Guard, a safety-aware information bottleneck defense for black-box LLMs. RIB-Guard learns a token-level masking policy that extracts a minimally safety-sufficient prompt via reinforcement learning using only black-box feedback. In addition, it introduces an independent lightweight safety head to estimate residual jailbreak risk and provide model-agnostic safety guidance during training. The proposed framework jointly balances prompt compactness, benign utility preservation, and residual risk suppression within a unified objective. Experimental results on direct single-turn harmful and benign prompt settings show that RIB-Guard improves jailbreak robustness while maintaining competitive benign utility. By extending information bottleneck-based prompt protection from white-box to black-box settings, RIB-Guard provides a step toward safety-aware information-theoretic front-end defense for black-box LLMs.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper