What question did this study set out to answer?

The aim is to assess if multi-LLM collaboration can make abstract screening for systematic reviews more efficient and cost-effective.

February 9, 2026Open Access

LLM-based Multi-Agent Collaboration for Abstract Screening towards Automated Systematic Reviews

Read Full Paperexternally

Key Points

The aim is to assess if multi-LLM collaboration can make abstract screening for systematic reviews more efficient and cost-effective.
Framed abstract screening as a question-answering task using LLMs.
Evaluated three multi-LLM collaboration strategies: majority voting, multi-agent debate, and LLM-based adjudication.
Tested on 28 systematic reviews using performance metrics like Mean Average Precision (MAP) and Work Saved over Sampling at 95% recall.
Multi-LLM collaboration outperformed QA baselines in abstract screening efficiency.
Majority voting was the best strategy, achieving highest MAP scores and enabling up to 68% workload reduction.
MAD improved performance of weaker models, while adjudicator-as-a-ranker was the second strongest but more costly.

Abstract

Abstract Objective Systematic reviews (SRs) are essential for evidence-based practice but remain labor-intensive, especially during abstract screening. This study evaluates whether multiple large language model (multi-LLM) collaboration can improve the efficiency and reduce costs for abstract screening. Methods Abstract screening was framed as a question-answering (QA) task using cost-effective LLMs. Three multi-LLM collaboration strategies were evaluated, including majority voting by averaging opinions of peers, multi-agent debate (MAD) for answer refinement, and LLM-based adjudication against answers of individual QA baselines. These strategies were evaluated on 28 SRs of the CLEF eHealth 2019 Technology-Assisted Review benchmark using standard performance metrics such as Mean Average Precision (MAP) and Work Saved over Sampling at 95% recall (WSS@95%). Results Multi-LLM collaboration significantly outperformed QA baselines. Majority voting was overall the best strategy, achieving the highest MAP 0.462 and 0.341 on subsets of SRs about clinical intervention and diagnostic technology assessment, respectively, with WSS@95% 0.606 and 0.680, enabling in theory up to 68% workload reduction at 95% recall of all relevant studies. MAD improved weaker models most. Our own adjudicator-as-a-ranker method was the second strongest approach, surpassing adjudicator-as-a-judge, but at a significantly higher cost than majority voting and debating. Conclusion Multi-LLM collaboration substantially improves abstract screening efficiency, and the success lies in model diversity. Making the best use of diversity, majority voting stands out in terms of both excellent performance and low cost compared to adjudication. Despite context-dependent gains and diminishing model diversity, MAD is still a cost-effective strategy and a potential direction of further research.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Opeoluwa Akinseloyin

Coventry University

Xiaorui Jiang

University of Sheffield

Vasile Palade

Naval Research Laboratory Information Technology Division

Journals

Biology Methods and Protocols

Actions

Institutions

University of Sheffield

Coventry University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

LLM-based Multi-Agent Collaboration for Abstract Screening towards Automated Systematic Reviews

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study