Retrieval-Augmented Generation (RAG) has emerged as a pivotal framework for knowledge-intensive reasoning by coupling external retrieval with generative capabilities. However, existing RAG systems suffer from a critical granularity mismatch problem: coarse-grained retrieval units (entire passages or global images) fail to align with the fine-grained reasoning requirements of generation tasks, particularly in multimodal contexts where localized visual evidence is essential. We propose Fine-Grained Retrieval-Augmented Generation (FG-RAG), a unified framework that bridges this semantic resolution gap through two key innovations. First, we enhance the CLIP architecture with patch-level contrastive supervision, enabling explicit alignment between localized image regions and corresponding textual fragments. Second, we introduce a joint retrieval-reranking optimization mechanism that unifies a dense retriever with a large language model (LLM)-based reranker through a shared relevance loss. To address the non-differentiable nature of LLM generation, we employ a Score Alignment Strategy where generative likelihoods provide structural supervision for the retriever, creating bidirectional feedback between retrieval precision and generation quality. Comprehensive evaluation on MSCOCO and Flickr30k benchmarks demonstrates that FG-RAG achieves significant retrieval gains (Recall@1 = 0.8845 on MSCOCO), outperforming state-of-the-art methods by up to 7.5% across datasets. In visual question answering, our framework achieves 0.4353 F1 score on MSCOCO and reduces hallucination rates by 15% points compared to conventional RAG systems. Ablation studies confirm the necessity of both fine-grained modeling and joint optimization components, with their removal causing substantial performance degradation in critical metrics. These results establish that fine-grained semantic alignment coupled with closed-loop optimization substantially enhances the factual grounding and contextual coherence of multimodal generation systems.
Feng et al. (Mon,) studied this question.