The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark specifically designed to evaluate hallucinations in visual relationships. R-Bench includes both image-level questions to assess the existence of relationships and instance-level questions that probe deeper into local visual comprehension. Our analysis reveals that relationship hallucinations arise from three types of co-occurrences: relationship-relationship, subject-relationship, and relationship-object, exacerbated by the long-tail distribution in visual datasets. Moreover, LVLMs often ignore visual content, over-relying on common sense from language models, particularly in spatial reasoning tasks. We further demonstrate that region-level image-text alignment helps mitigate relationship hallucinations and propose a new baseline, Region-Aware Alignment Mitigation (RA2M), that enhances model attention to relevant regions, improving alignment between generated text and images.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mingrui Wu
Jiale Li
Jiayi Ji
IEEE Transactions on Pattern Analysis and Machine Intelligence
National University of Singapore
Xiamen University
Ministry of Education of the People's Republic of China
Building similarity graph...
Analyzing shared references across papers
Loading...
Wu et al. (Thu,) studied this question.
www.synapsesocial.com/papers/698584f98f7c464f2300839e — DOI: https://doi.org/10.1109/tpami.2026.3656175