What question did this study set out to answer?

The aim is to address hallucination in large vision-language models by improving visual attention mechanisms.

April 25, 2026Open Access

Rethinking Visual Attention for Reducing Hallucination in Large Vision–Language Models

Key Points

The aim is to address hallucination in large vision-language models by improving visual attention mechanisms.
Employ a tuning-free attention intervention at inference time.
Apply structured rescaling to attention logits of visual tokens in the encoding stage.
Filter attention heads by response magnitudes in the decoding stage and perform weighted aggregation.
Reduced sentence-level metrics by 15.5% and instance-level metrics by 5.7% on the CHAIR benchmark.
Consistently improved performance across multiple large vision-language models.
Maintained strong results on general multimodal benchmarks such as MME.

Abstract

Large Vision–Language Models (LVLMs) have achieved strong performance in multimodal understanding and generation. However, they remain prone to hallucination, where generated content deviates from the visual input, reducing output reliability. We analyze the attention mechanism and identify two key issues in visual information use. The model exhibits insufficient overall attention to visual tokens and weak or dispersed attention to semantically relevant regions, limiting effective visual grounding. We propose a tuning-free attention intervention method applied at inference time. In the encoding stage, we apply a structured rescaling to the attention logits associated with visual tokens, introducing a structural bias in the visual subspace. In the decoding stage, we filter attention heads based on their response magnitudes and perform weighted aggregation using their global response intensities. This design reinforces salient visual evidence while suppressing weak or diffuse attention patterns. Experiments on CHAIR and POPE show that our method reduces hallucination without additional training. On the CHAIR benchmark, it reduces the sentence-level metric by 15.5% and the instance-level metric by 5.7% on average, while consistently improving performance across multiple LVLMs and maintaining strong results on general multimodal benchmarks such as MME.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Li et al. (Thu,) studied this question.

synapsesocial.com/papers/69ec5b8a88ba6daa22dad0b3 https://doi.org/https://doi.org/10.3390/app16094143

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper