What question did this study set out to answer?

The aim is to improve the retrieval and reasoning capabilities for multimodal industrial documents using a hierarchical multi-agent approach.

March 16, 2026Open Access

Hierarchical multi-agent reinforcement learning for retrieval-augmented industrial document question answering

Key Points

The aim is to improve the retrieval and reasoning capabilities for multimodal industrial documents using a hierarchical multi-agent approach.
Developed a hierarchical multi-agent reinforcement learning framework called MARL-RAGDoc.
Introduced a coordinator agent to allocate dynamic modality weights and retrieval depth.
Utilized specialized agents for text, image, and table evidence selection.
Implemented a collaborative reasoning module for integrating retrieved evidence and providing rewards.
MARL-RAGDoc achieved significantly higher retrieval accuracy compared to baseline models.
Improved reasoning performance in complex long-document scenarios was observed.
The framework maintained computational efficiency throughout the evaluations.

Abstract

Abstract Multimodal industrial documents–such as operation manuals, circuit diagrams, and parameter tables–contain domain knowledge distributed across text, images, and document layout. However, most existing retrieval-augmented generation (RAG) frameworks rely on static retrieval and fusion policies with fixed modality weights and uniform retrieval depth, making them less adaptable to diverse query intents and dynamic cross-modal dependencies. As a result, they often retrieve incomplete evidence and yield suboptimal reasoning in complex long-document scenarios. To address these challenges, we propose MARL-RAGDoc, a hierarchical multi-agent reinforcement learning framework for multimodal retrieval-augmented reasoning. A high-level coordinator agent dynamically allocates modality weights and retrieval depth based on query characteristics, while specialized text, image, and table agents perform fine-grained evidence selection within their respective candidate pools. A collaborative reasoning module integrates the retrieved evidence and provides hierarchical reward signals to continuously optimize retrieval policies. Experimental results on multiple multimodal document benchmarks demonstrate that MARL-RAGDoc consistently outperforms baselines in both retrieval accuracy and reasoning performance, while remaining computationally efficient. Our code and dataset are publicly available at https://github.com/Yihong-Q/MARL-RAGDoc .

Bookmark

View Full Paper

Bookmark

View Full Paper

Hierarchical multi-agent reinforcement learning for retrieval-augmented industrial document question answering

Key Points

Abstract

Cite This Study