What question did this study set out to answer?

The research aims to address the challenges in visual document question answering, especially the retrieval-generation gap.

June 6, 2026Open Access

Bridging the Retrieval-Generation Gap: A Hybrid OCR-Vision Architecture for Visual Document Question Answering

Key Points

The research aims to address the challenges in visual document question answering, especially the retrieval-generation gap.
Systematic evaluation of three multimodal retrieval-augmented generation paradigms on the DocVQA benchmark.
Development of the ColPali-Fusion++ architecture combining late-interaction visual retrieval and OCR text extraction.
Ablation studies to assess each component's contribution to overall performance.
Achieved 71.3% accuracy on the DocVQA benchmark, a 24.6 percentage point increase over the ColPali baseline.
Reduced hallucination rates from 62.4% to 40.9%.
Identified key factors causing the retrieval-generation gap in existing systems.

Abstract

Visual document question answering presents a fundamental challenge for retrieval-augmented generation systems: documents encode information through complex interactions of text, layout, tables, and visual elements that text-only pipelines can not effectively process. Recent vision-based retrieval methods demonstrate strong performance in identifying relevant document pages, yet downstream answer generation remains problematic. We formalize this phenomenon as the retrieval-generation gap, defined as the discrepancy between retrieval recall and answer accuracy. Through systematic evaluation of three dominant multimodal RAG paradigms on the DocVQA benchmark, we quantify this gap and identify its root causes. We then propose ColPali-Fusion++, a hybrid architecture that integrates late-interaction visual retrieval with OCR-based text extraction and adaptive context assembly. Our architecture achieves 71.3% accuracy on DocVQA, representing a 24.6 percentage point improvement over the ColPali baseline, while reducing hallucination rates from 62.4% to 40.9%. Ablation studies confirm that each architectural component contributes meaningfully to performance gains. These results demonstrate that bridging the retrieval-generation gap requires combining the semantic matching capabilities of visual retrievers with the precise text extraction afforded by modern OCR systems.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

EXERGY (Wed,) studied this question.

synapsesocial.com/papers/6a23bb2071a5da9775e76b3e https://doi.org/https://doi.org/10.5281/zenodo.20549254

Bookmark

View Full Paper