Abstract Large Multimodal Models (LMMs) have achieved remarkable performance across vision-language tasks, yet their robustness against adversarial attacks remains critically underexplored. While LMMs are vulnerable to visual encoder attacks, they exhibit surprising resilience due to encoder diversity—attacks optimized for CLIP fail to transfer to EVA-CLIP, especially when textual context is provided. We introduce the Adaptive Ensemble PGD (AE-PGD) attack, which simultaneously targets both encoders through three key innovations: (1) dynamic adversarial caption selection , combining gradient magnitude with global semantic displacement to identify the most attack-effective caption per model; (2) an adaptive weight controller , dynamically balancing each encoder’s contribution using real-time loss, gradient norm, and confidence metrics; and (3) an Expectation over Transforms (EoT) gradient update ensuring robustness against input-transformation defenses. Evaluated on COCO 2014 images, AE-PGD reduces accuracy from a 75.42% baseline to 0.0% across all three evaluation metrics—visual encoding, image-to-text recall, and LLM answer recall—achieving complete model collapse. Manifold analysis confirms that adversarial perturbations push image embeddings to antipodal regions of the joint embedding space, activating semantically opposite concept clusters and producing structured hallucinations. WordNet WUP similarity analysis reveals a 33.5 percentage point semantic drop across the test set. AE-PGD causes state-of-the-art LMMs (LLaVA, Qwen-VL, GPT-4V) to catastrophically misidentify a bullet train as a “helicopter crash,” with strong black-box transfer yielding a 65 percentage point recall collapse on unseen encoders. This work exposes critical vulnerabilities in current LMM architectures and underscores the urgent need for ensemble-aware defense mechanisms.
Pandey et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: