The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark specifically designed to evaluate hallucinations in visual relationships. R-Bench includes both image-level questions to assess the existence of relationships and instance-level questions that probe deeper into local visual comprehension. Our analysis reveals that relationship hallucinations arise from three types of co-occurrences: relationship-relationship, subject-relationship, and relationship-object, exacerbated by the long-tail distribution in visual datasets. Moreover, LVLMs often ignore visual content, over-relying on common sense from language models, particularly in spatial reasoning tasks. We further demonstrate that region-level image-text alignment helps mitigate relationship hallucinations and propose a new baseline, Region-Aware Alignment Mitigation (RA2M), that enhances model attention to relevant regions, improving alignment between generated text and images.
Wu et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: