What question did this study set out to answer?

This research aims to improve the performance of large language models in medical question answering by using a multi-agent reasoning method.

June 17, 2026

Let large language models judge each other: multi-agent peer-reviewed reasoning for medical question answering

Puntos clave

This research aims to improve the performance of large language models in medical question answering by using a multi-agent reasoning method.
Developed a multi-agent peer-reviewed reasoning framework where LLMs evaluate each other's outputs.
Conducted experiments with 5 LLMs across 3 benchmark datasets.
Compared performance against single-model reasoning and voting ensembles.
Peer-reviewed reasoning achieved an average accuracy of 0.820, outperforming the best single model at 0.777.
Majority voting ensembles reached a maximum accuracy of 0.789.
The approach effectively scaled with more LLMs, reliably distinguishing high- and low-quality reasoning.

Resumen

OBJECTIVE: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). MATERIALS AND METHODS: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought (CoT) reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with 5 state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, and GPT-oss-20B) on 3 benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model CoT reasoning and CoT-based majority voting. RESULTS: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. CONCLUSION: The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.

Preguntar a la IA

Me gusta

Guardar