What question did this study set out to answer?

The aim is to enhance multimodal large language models by reducing hallucination and improving factual accuracy through a novel optimization method.

May 18, 2026

VGL-DPO: Vision-Guided Lexical Direct Preference Optimization for Mitigating Hallucination in Multimodal Large Language Models

Key Points

The aim is to enhance multimodal large language models by reducing hallucination and improving factual accuracy through a novel optimization method.
Developed VGL-DPO to quantify and weight words based on their relevance to visual input during training.
Implemented dynamic reweighting of preference loss to improve optimization based on word significance.
Conducted extensive experiments on various multimodal models to assess effectiveness in hallucination suppression.
VGL-DPO significantly reduces hallucination rates compared to existing methods, evidenced by improved factual accuracy.
Offers a robust mechanism for weighing words during training, leading to higher precision in text generation.
Demonstrates superior performance in alignment tasks across multiple multimodal large language models.

Abstract

Multimodal large language models (MLLMs) have achieved significant advancements in multimodal understanding, reasoning, and interaction. However, they still suffer from hallucination, where the generated text often deviates from the factual content of the input image. To mitigate this issue, prior studies have primarily employed direct preference optimization (DPO) for human preference alignment. However, these approaches treat all textual words equally, neglecting the varying significance of individual words in grounding text generation to image content. This limitation hinders fine-grained semantic alignment and consequently constrains their effectiveness in hallucination suppression. To address this limitation, we propose a vision-guided lexical direct preference optimization method, called VGL-DPO. Specifically, we quantify the significance of words in positive preference data based on their relevance to the visual input and dynamically assign different weights to different words during training. This facilitates more precise optimization by emphasizing critical words that contribute to factual grounding. Additionally, we leverage the importance differences between high-significance words in positive and negative preference data to adaptively adjust the weight of the negative preference loss. This dynamic reweighting mechanism further refines the model’s ability to suppress hallucinated content while reinforcing factual accuracy. Extensive experiments across various models demonstrate that our method outperforms existing state-of-the-art methods in reducing hallucination and enhancing the factual accuracy.

KI fragen

Bookmark

Cite This Study

Li et al. (Sat,) studied this question.

synapsesocial.com/papers/6a0aace55ba8ef6d83b705e7 https://doi.org/https://doi.org/10.1145/3796715

KI fragen

Bookmark