What question did this study set out to answer?

This work aims to investigate the vulnerabilities of vision-language models, particularly focusing on the impact of adversarial attacks on their performance.

April 11, 2026Open Access

Understanding Modality-Specific Vulnerabilities in Vision–Language Models Under Adversarial Attacks

Key Points

This work aims to investigate the vulnerabilities of vision-language models, particularly focusing on the impact of adversarial attacks on their performance.
Applied gradient-based adversarial attacks to vision and language modalities of CLIP
Evaluated performance on Facebook Hateful Memes dataset and Suspicious Car Parts dataset
Used Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD)
Explored multiple adversarial retraining strategies
Adversarial attacks on the image modality led to the most significant performance declines
Vision modality demonstrated the greatest instability under attack conditions
Findings underscore the need for specialized defense strategies targeting weaker modalities in multimodal systems

Abstract

Vision–language models (VLMs), such as Contrastive Language–Image Pretraining (CLIP), are increasingly deployed in real-world applications, including content moderation, misinformation detection, and fraud analysis, making their robustness to adversarial attacks a critical concern. While adversarial robustness has been widely studied in unimodal models, modality-specific vulnerabilities in multimodal models remain underexplored. In this work, we analyze CLIP by applying gradient-based adversarial attacks to its vision and language modalities, both independently and jointly, and evaluating performance on two multimodal classification benchmarks: the Facebook Hateful Memes dataset and a large-scale Suspicious Car Parts dataset. Using Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks along with multiple adversarial retraining strategies, we show that adversarial perturbations on the image modality consistently cause the most severe and unstable performance degradation. These results demonstrate that the vision modality is the primary vulnerability in CLIP, highlighting the need for modality-specific defense strategies that focus more on the weaker modality in multimodal systems.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper