Large language models (LLMs) have recently demonstrated impressive advances in complex reasoning, yet their performance in clinical natural language processing (NLP) remains limited. Clinical tasks require grounding in extensive domain-specific knowledge, precise evidence integration, and reliable multi-step reasoning–capabilities that current LLMs struggle to achieve. Retrieval-Augmented Generation (RAG) offers a promising solution by incorporating external medical knowledge without additional model training. However, existing clinical RAG systems face three major challenges: imprecise retrieval from long and complex medical documents, difficulty transforming retrieved evidence into coherent reasoning processes, and high sensitivity to retrieval noise. To address these limitations, we introduce ER-MedRAG (Extractor-Respondent Medical Retrieval-Augmented Generation), a multi-agent reinforcement learning framework designed to enhance clinical reasoning in RAG systems. ER-MedRAG employs an Extractor–Respondent architecture that first performs a coarse-to-fine hybrid retrieval process to identify highly relevant evidence snippets. The extractor agent then converts each snippet into a structured condition–relation–conclusion reasoning triplet. These triplets are subsequently concatenated into a unified representation and passed to the respondent agent to guide clinical decision-making. To strengthen each agent’s specialized capabilities, we develop a two-stage reinforcement learning paradigm: the extractor is optimized using Direct Preference Optimization (DPO) to generate concise and informative reasoning triplets, while the respondent is trained with Group Relative Policy Optimization (GRPO) to effectively leverage structured evidence and remain robust to retrieval noise. We evaluate ER-MedRAG on six medical question answering benchmarks spanning multiple difficulty levels, including MedQA, MedMCQA, PubMedQA, MMLU-ProM, GPQA-M, and MedXpertQA, using both 7B/8B and 70B open-source base models. Experimental results demonstrate that ER-MedRAG consistently outperforms strong RAG and reinforcement-learning-based baselines, achieving accuracy gains ranging from 3% to 6% across six medical question answering datasets, with especially pronounced improvements on reasoning-intensive benchmarks such as MMLU-ProM, GPQA-M, and MedXpertQA. Moreover, ER-MedRAG reduces output entropy, indicating more stable and reliable clinical reasoning.
Shi et al. (Sat,) studied this question.