While Vision-Language Models (VLMs) have achieved remarkable success in tasks involving natural RGB images, their capability to understand non-RGB sensor data, including thermal, depth, hyperspectral, and X-ray imagery, remains severely limited. This limitation stems from an entrenched RGB-centric bias, leading current VLMs to treat these distinct modalities as ordinary photographs, thus failing to account for their unique physical properties. To systematically evaluate and address this pervasive issue, we present CausalSense, a novel benchmark suite designed to expose RGB-centric bias within large-scale VLMs using non-RGB sensor data. Concurrently, we devise a causal learning framework specifically engineered to alleviate this RGB-bounded bias. Our approach effectively employs confounder dictionaries and backdoor adjustments from causal inference to integrate essential sensor-specific knowledge into VLMs, circumventing the need for extensive retraining on massive datasets. Our comprehensive evaluations using CausalSense underscore a significant performance deficiency in state-of-the-art VLMs concerning non-RGB vision sensor comprehension. Crucially, we demonstrate that our proposed causal deconfounded cross-modal encoder substantially improves VLMs' ability to reason about the physical attributes captured by these modalities, thereby achieving a measurable reduction in the observed performance gap. This combined benchmark and framework pave the way for developing more resilient and sensor-aware vision-language models, capable of robustly interpreting diverse real-world phenomena beyond the visible spectrum.
Yu et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: