What question did this study set out to answer?

The research aims to diagnose and improve reasoning consistency in large vision-language models affected by distractions.

January 22, 2026

Defying Distractions in Multimodal Tasks: A Novel Benchmark for Large Vision-Language Models

Key Points

The research aims to diagnose and improve reasoning consistency in large vision-language models affected by distractions.
Introduced the IR-VQA benchmark to evaluate vulnerabilities in models across four paradigms.
Developed Positive Consistency (PC) and Negative Consistency (NC) metrics for assessing reasoning stability.
Implemented the Relevance-Gated Multimodal Routing (RGMR) mechanism to filter distractions at inference time.
Significant drops in reasoning consistency observed in state-of-the-art models on the IR-VQA benchmark.
Finetuning on IR-VQA improved model robustness significantly.
RGMR demonstrated effectiveness in enhancing model performance where standard approaches failed.

Abstract

Large Vision-Language Models (LVLMs) with "multimodal distractibility," where plausible but irrelevant visual or textual inputs cause significant drops in reasoning consistency and lead to unreliable outputs. This paper introduces a comprehensive framework to systematically diagnose, evaluate, and mitigate this critical challenge. We present three core components: the large-scale IR-VQA benchmark to surface these vulnerabilities across four paradigms; novel diagnostic metrics, Positive Consistency (PC) and Negative Consistency (NC), which move beyond standard accuracy to rigorously measure a model's reasoning stability; and the Relevance-Gated Multimodal Routing (RGMR) mechanism, a novel, lightweight module that proactively and dynamically filters distractions at inference time. Our experiments reveal that state-of-the-art models exhibit significant drops in consistency on IR-VQA. We demonstrate that finetuning on IR-VQA and deploying RGMR substantially improve model robustness where standard prompting fails. Our comprehensive analysis of model behaviors under different types of distractions and the underlying reasoning failures provides a clear path forward for developing more reliable multimodal systems.

AIに質問

Bookmark

Cite This Study

Yang et al. (Thu,) studied this question.

synapsesocial.com/papers/6971bd90642b1836717e23fd https://doi.org/https://doi.org/10.1109/tpami.2026.3655641

AIに質問

Bookmark