Key points are not available for this paper at this time.
Multimodal large language models (MLLMs) have achieved significant advancements in multimodal understanding, reasoning, and interaction. However, they still suffer from hallucination, where the generated text often deviates from the factual content of the input image. To mitigate this issue, prior studies have primarily employed direct preference optimization (DPO) for human preference alignment. However, these approaches treat all textual words equally, neglecting the varying significance of individual words in grounding text generation to image content. This limitation hinders fine-grained semantic alignment and consequently constrains their effectiveness in hallucination suppression. To address this limitation, we propose a vision-guided lexical direct preference optimization method, called VGL-DPO. Specifically, we quantify the significance of words in positive preference data based on their relevance to the visual input and dynamically assign different weights to different words during training. This facilitates more precise optimization by emphasizing critical words that contribute to factual grounding. Additionally, we leverage the importance differences between high-significance words in positive and negative preference data to adaptively adjust the weight of the negative preference loss. This dynamic reweighting mechanism further refines the model’s ability to suppress hallucinated content while reinforcing factual accuracy. Extensive experiments across various models demonstrate that our method outperforms existing state-of-the-art methods in reducing hallucination and enhancing the factual accuracy.
Building similarity graph...
Analyzing shared references across papers
Loading...
Siyuan Li
F H Wang
Simeng Qin
ACM Transactions on Multimedia Computing Communications and Applications
Tianjin University
Northeastern University
Alibaba Group (China)
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Sat,) studied this question.
www.synapsesocial.com/papers/6a0aace55ba8ef6d83b705e7 — DOI: https://doi.org/10.1145/3796715