While Large Vision Language Models (LVLMs) exhibit remarkable capabilities, their visual modality introduces a critical attack surface that can bypass text only safety alignments. This paper evaluates the vulnerability of LLaVA-1. 5 to targeted adversarial visual prompts designed to induce malicious compliance. Using a Projected Gradient Descent (PGD) attack on the MM-SafetyBench dataset, we evaluate 1000 samples across five high risk categories. To eliminate false positives caused by superficial compliance, we apply a rigorous metric that strictly demands sustained, direct compliance without late stage refusals. Our results demonstrate that imperceptible visual perturbations effectively hijack safety guardrails, achieving Attack Success Rates (ASR) of 95% to 100% across all categories at perturbation budgets of 8/255. Furthermore, analysis of the Modality Gap (₌₆) reveals that adversarial visual embeddings overpower textual safety constraints, forcing a malicious multimodal alignment. These findings underscore the inadequacy of current unimodal safety fine tuning and highlight the urgent need for robust, multimodal specific defense mechanisms.
Abdullah et al. (Fri,) studied this question.