Abstract Multimodal industrial documents–such as operation manuals, circuit diagrams, and parameter tables–contain domain knowledge distributed across text, images, and document layout. However, most existing retrieval-augmented generation (RAG) frameworks rely on static retrieval and fusion policies with fixed modality weights and uniform retrieval depth, making them less adaptable to diverse query intents and dynamic cross-modal dependencies. As a result, they often retrieve incomplete evidence and yield suboptimal reasoning in complex long-document scenarios. To address these challenges, we propose MARL-RAGDoc, a hierarchical multi-agent reinforcement learning framework for multimodal retrieval-augmented reasoning. A high-level coordinator agent dynamically allocates modality weights and retrieval depth based on query characteristics, while specialized text, image, and table agents perform fine-grained evidence selection within their respective candidate pools. A collaborative reasoning module integrates the retrieved evidence and provides hierarchical reward signals to continuously optimize retrieval policies. Experimental results on multiple multimodal document benchmarks demonstrate that MARL-RAGDoc consistently outperforms baselines in both retrieval accuracy and reasoning performance, while remaining computationally efficient. Our code and dataset are publicly available at https://github.com/Yihong-Q/MARL-RAGDoc .
Qian et al. (Sat,) studied this question.