What question did this study set out to answer?

This research focuses on bridging the granularity mismatch in retrieval-augmented generation systems for improved visual question answering.

June 18, 2026Open Access

Fine grained reranking via caption bridging for knowledge augmented visual question answering

Key Points

This research focuses on bridging the granularity mismatch in retrieval-augmented generation systems for improved visual question answering.
Proposed Fine-Grained Retrieval-Augmented Generation (FG-RAG) framework
Enhanced CLIP architecture with patch-level contrastive supervision
Joint retrieval-reranking optimization mechanism using a large language model
Achieved Recall@1 of 0.8845 on MSCOCO, outperforming prior methods by 7.5%
Obtained an F1 score of 0.4353 on MSCOCO for visual question answering
Reduced hallucination rates by 15 percentage points compared to conventional RAG systems

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a pivotal framework for knowledge-intensive reasoning by coupling external retrieval with generative capabilities. However, existing RAG systems suffer from a critical granularity mismatch problem: coarse-grained retrieval units (entire passages or global images) fail to align with the fine-grained reasoning requirements of generation tasks, particularly in multimodal contexts where localized visual evidence is essential. We propose Fine-Grained Retrieval-Augmented Generation (FG-RAG), a unified framework that bridges this semantic resolution gap through two key innovations. First, we enhance the CLIP architecture with patch-level contrastive supervision, enabling explicit alignment between localized image regions and corresponding textual fragments. Second, we introduce a joint retrieval-reranking optimization mechanism that unifies a dense retriever with a large language model (LLM)-based reranker through a shared relevance loss. To address the non-differentiable nature of LLM generation, we employ a Score Alignment Strategy where generative likelihoods provide structural supervision for the retriever, creating bidirectional feedback between retrieval precision and generation quality. Comprehensive evaluation on MSCOCO and Flickr30k benchmarks demonstrates that FG-RAG achieves significant retrieval gains (Recall@1 = 0.8845 on MSCOCO), outperforming state-of-the-art methods by up to 7.5% across datasets. In visual question answering, our framework achieves 0.4353 F1 score on MSCOCO and reduces hallucination rates by 15% points compared to conventional RAG systems. Ablation studies confirm the necessity of both fine-grained modeling and joint optimization components, with their removal causing substantial performance degradation in critical metrics. These results establish that fine-grained semantic alignment coupled with closed-loop optimization substantially enhances the factual grounding and contextual coherence of multimodal generation systems.

Mark Helpful

Bookmark

Relay

View Full Paper