What question did this study set out to answer?

The aim is to detect and reduce hallucinations in multimodal large language models by improving visual attention tracing.

June 6, 2026Open Access

Mitigating multimodal hallucinations through visual attention tracing and origin-point regeneration

Key Points

The aim is to detect and reduce hallucinations in multimodal large language models by improving visual attention tracing.
Introduced a training-free decoding framework called hallucination backtracking (HB) to monitor visual attention dynamics during text generation.
Developed a visual attention score (VAS) to quantify the drift in the model's focus from image features to prior text generations.
Evaluated performance across multiple architectures including LLaVA-1.5 and MiniGPT-4.
Achieved a 41.8% exact match in localization of hallucination sources and 84.1% before-first accuracy.
On LLaVA-1.5, improved F1 score on the POPE benchmark to 91.4% and reduced CHAIR S metric to 40.2%.
Observed a residual false negative rate of 15.9% indicating ongoing challenges with inference-driven hallucinations.

Abstract

Abstract Multimodal large language models (MLLMs) exhibit impressive prowess in vision-language understanding, but their utility is often compromised by hallucinations—instances where generated narratives diverge significantly from visual evidence. Current remedial strategies largely struggle to pinpoint the genesis of these errors, relying either on resource-intensive retraining or indiscriminate global penalties that fail to address the specific locus of the discrepancy. Addressing this limitation, we introduce hallucination backtracking (HB), a training-free decoding framework designed to effectively detect and mitigate errors by monitoring visual attention dynamics during generation. This approach is grounded in the observation that hallucinations are not random; rather, they stem from specific pivotal tokens where the model’s focus precipitously shifts from image features to its own prior textual generations. By quantifying this drift through a novel visual attention score (VAS), our origin-point detection mechanism successfully isolates the source of errors, achieving a 41. 8% exact match and 84. 1% before-first accuracy in localization. Once an attentional anomaly is detected, the system autonomously backtracks to the divergence point, triggering a regeneration process reinforced by stricter visual grounding constraints. Rigorous evaluations across diverse architectures—LLaVA-1. 5, InstructBLIP, MiniGPT-4, and Shikra—confirm that HB consistently surpasses state-of-the-art baselines; notably, on LLaVA-1. 5, our method elevates the F1 score on the POPE benchmark to 91. 4% while reducing the CHAIR S metric to 40. 2%, yielding improvements of 1. 5 and 4. 4 points over OPERA, respectively, though a residual false negative rate of 15. 9% indicates that inference-driven hallucinations remain an open challenge. Beyond quantitative gains, we provide a granular dissection of the hallucination phenomenon through analyses of VAS trajectory patterns and failure modes, ultimately advocating for precise localization and targeted correction as a promising paradigm for reliable multimodal generation.

Bookmark

View Full Paper

Bookmark

View Full Paper

Mitigating multimodal hallucinations through visual attention tracing and origin-point regeneration

Key Points

Abstract

Cite This Study