Abstract Multimodal large language models (MLLMs) exhibit impressive prowess in vision-language understanding, but their utility is often compromised by hallucinations—instances where generated narratives diverge significantly from visual evidence. Current remedial strategies largely struggle to pinpoint the genesis of these errors, relying either on resource-intensive retraining or indiscriminate global penalties that fail to address the specific locus of the discrepancy. Addressing this limitation, we introduce hallucination backtracking (HB), a training-free decoding framework designed to effectively detect and mitigate errors by monitoring visual attention dynamics during generation. This approach is grounded in the observation that hallucinations are not random; rather, they stem from specific pivotal tokens where the model’s focus precipitously shifts from image features to its own prior textual generations. By quantifying this drift through a novel visual attention score (VAS), our origin-point detection mechanism successfully isolates the source of errors, achieving a 41. 8% exact match and 84. 1% before-first accuracy in localization. Once an attentional anomaly is detected, the system autonomously backtracks to the divergence point, triggering a regeneration process reinforced by stricter visual grounding constraints. Rigorous evaluations across diverse architectures—LLaVA-1. 5, InstructBLIP, MiniGPT-4, and Shikra—confirm that HB consistently surpasses state-of-the-art baselines; notably, on LLaVA-1. 5, our method elevates the F1 score on the POPE benchmark to 91. 4% while reducing the CHAIR S metric to 40. 2%, yielding improvements of 1. 5 and 4. 4 points over OPERA, respectively, though a residual false negative rate of 15. 9% indicates that inference-driven hallucinations remain an open challenge. Beyond quantitative gains, we provide a granular dissection of the hallucination phenomenon through analyses of VAS trajectory patterns and failure modes, ultimately advocating for precise localization and targeted correction as a promising paradigm for reliable multimodal generation.
Li et al. (Thu,) studied this question.