Vision–language models (VLMs), such as Contrastive Language–Image Pretraining (CLIP), are increasingly deployed in real-world applications, including content moderation, misinformation detection, and fraud analysis, making their robustness to adversarial attacks a critical concern. While adversarial robustness has been widely studied in unimodal models, modality-specific vulnerabilities in multimodal models remain underexplored. In this work, we analyze CLIP by applying gradient-based adversarial attacks to its vision and language modalities, both independently and jointly, and evaluating performance on two multimodal classification benchmarks: the Facebook Hateful Memes dataset and a large-scale Suspicious Car Parts dataset. Using Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks along with multiple adversarial retraining strategies, we show that adversarial perturbations on the image modality consistently cause the most severe and unstable performance degradation. These results demonstrate that the vision modality is the primary vulnerability in CLIP, highlighting the need for modality-specific defense strategies that focus more on the weaker modality in multimodal systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Maisha Binte Rashid
Pablo Rivas
AI
Baylor University
Marist College
Building similarity graph...
Analyzing shared references across papers
Loading...
Rashid et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69d9e60578050d08c1b7646a — DOI: https://doi.org/10.3390/ai7040135
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: