Vision-Language Models (VLMs) are increasingly being adopted in radiology for tasks ranging from automated image interpretation to report generation and visual question answering (VQA). Yet these models have a well-documented tendency to produce clinically unfaithful outputs, commonly referred to as hallucinations, which raise serious patient safety concerns in diagnostic settings. Although hallucination detection has attracted growing interest in the broader computer vision community, the specific problem of spatially-grounded hallucination within medical imaging has received comparatively little attention. This paper addresses that gap. We present what is, to our knowledge, the first systematic analysis focused specifically on spatial hallucination phenomena in radiology VQA. We introduce a four-tier taxonomy of medical spatial hallucinations, organized into Existence Fabrication, Anatomical Mislocalization, Spatial Relationship Distortion, and Volumetric Reasoning Failure, each grounded in clinical radiology practice. We analyze 14 VLMs across six radiology VQA benchmarks and evaluate their spatial grounding fidelity using established metrics alongside a new composite metric we call the Spatially-Grounded Hallucination Index (SGHI). Our findings indicate that spatial hallucination rates on radiology tasks range from 23.7% to 41.2%, substantially exceeding the 12.4%–18.9% observed on natural image benchmarks. We also review mitigation strategies and lay out a research roadmap toward clinically trustworthy, spatially-faithful multimodal medical AI.
Pratul Mishra (Mon,) studied this question.