What question did this study set out to answer?

This research aims to assess the vulnerability of LLaVA-1.5 to adversarial attacks using targeted visual prompts.

May 10, 2026Open Access

Quantifying modality imbalance and visual jailbreak robustness in LLaVA via projected gradient descent

Key Points

This research aims to assess the vulnerability of LLaVA-1.5 to adversarial attacks using targeted visual prompts.
Applied Projected Gradient Descent (PGD) attack on 1000 samples from the MM-SafetyBench dataset.
Evaluated high risk categories to quantify attack effectiveness.
Implemented strict metrics for compliance to eliminate false positives.
Achieved Attack Success Rates (ASR) of 95% to 100% at perturbation budgets of ε ≥ 8/255 across all categories.
Adversarial visual embeddings successfully overpowered textual safety constraints.
Findings indicate a significant modality gap where visual inputs can subvert safety mechanisms.

Abstract

While Large Vision Language Models (LVLMs) exhibit remarkable capabilities, their visual modality introduces a critical attack surface that can bypass text only safety alignments. This paper evaluates the vulnerability of LLaVA-1. 5 to targeted adversarial visual prompts designed to induce malicious compliance. Using a Projected Gradient Descent (PGD) attack on the MM-SafetyBench dataset, we evaluate 1000 samples across five high risk categories. To eliminate false positives caused by superficial compliance, we apply a rigorous metric that strictly demands sustained, direct compliance without late stage refusals. Our results demonstrate that imperceptible visual perturbations effectively hijack safety guardrails, achieving Attack Success Rates (ASR) of 95% to 100% across all categories at perturbation budgets of 8/255. Furthermore, analysis of the Modality Gap (₌₆) reveals that adversarial visual embeddings overpower textual safety constraints, forcing a malicious multimodal alignment. These findings underscore the inadequacy of current unimodal safety fine tuning and highlight the urgent need for robust, multimodal specific defense mechanisms.

Mark Helpful

Bookmark

Relay

View Full Paper