Large Vision–Language Models (LVLMs) have achieved strong performance in multimodal understanding and generation. However, they remain prone to hallucination, where generated content deviates from the visual input, reducing output reliability. We analyze the attention mechanism and identify two key issues in visual information use. The model exhibits insufficient overall attention to visual tokens and weak or dispersed attention to semantically relevant regions, limiting effective visual grounding. We propose a tuning-free attention intervention method applied at inference time. In the encoding stage, we apply a structured rescaling to the attention logits associated with visual tokens, introducing a structural bias in the visual subspace. In the decoding stage, we filter attention heads based on their response magnitudes and perform weighted aggregation using their global response intensities. This design reinforces salient visual evidence while suppressing weak or diffuse attention patterns. Experiments on CHAIR and POPE show that our method reduces hallucination without additional training. On the CHAIR benchmark, it reduces the sentence-level metric by 15.5% and the instance-level metric by 5.7% on average, while consistently improving performance across multiple LVLMs and maintaining strong results on general multimodal benchmarks such as MME.
Li et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: