Visual document question answering presents a fundamental challenge for retrieval-augmented generation systems: documents encode information through complex interactions of text, layout, tables, and visual elements that text-only pipelines can not effectively process. Recent vision-based retrieval methods demonstrate strong performance in identifying relevant document pages, yet downstream answer generation remains problematic. We formalize this phenomenon as the retrieval-generation gap, defined as the discrepancy between retrieval recall and answer accuracy. Through systematic evaluation of three dominant multimodal RAG paradigms on the DocVQA benchmark, we quantify this gap and identify its root causes. We then propose ColPali-Fusion++, a hybrid architecture that integrates late-interaction visual retrieval with OCR-based text extraction and adaptive context assembly. Our architecture achieves 71.3% accuracy on DocVQA, representing a 24.6 percentage point improvement over the ColPali baseline, while reducing hallucination rates from 62.4% to 40.9%. Ablation studies confirm that each architectural component contributes meaningfully to performance gains. These results demonstrate that bridging the retrieval-generation gap requires combining the semantic matching capabilities of visual retrievers with the precise text extraction afforded by modern OCR systems.
EXERGY (Wed,) studied this question.