What question did this study set out to answer?

To evaluate the effectiveness of different LLM coordination strategies in title-abstract screening tasks.

April 17, 2026Open Access

Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening

Key Points

To evaluate the effectiveness of different LLM coordination strategies in title-abstract screening tasks.
Compared five coordination strategies: single-agent baseline, majority voting, recall-focused ensemble, confidence-weighted aggregation, and two-stage filtering.
Used four open-source LLMs: Mistral 7B, LLaMA 3.1 8B, Granite 3.3 8B, Qwen 2.5 7B.
Employed zero-shot and few-shot configurations for model evaluation.
Analyzed a Gold Standard of 200 papers on blockchain-based e-voting from a corpus of 2036 records.
Achieved 100% recall, 70.4% precision, and 82.6% F1 score with the single-agent strategy using Qwen 2.5 7B in few-shot mode.
Realized a 43.4% reduction in manual screening efforts compared to multi-agent approaches.
Confidence-weighted aggregation yielded results similar to majority voting, indicating no added value from self-reported model confidence.

Abstract

Title-abstract screening remains labour-intensive, especially in interdisciplinary domains where shared terminology increases misclassification risk. This study compared five LLM coordination strategies—single-agent baseline, majority voting, recall-focused ensemble, confidence-weighted aggregation, and two-stage filtering—using four 4-bit quantised open-source models (Mistral 7B, LLaMA 3.1 8B, Granite 3.3 8B, Qwen 2.5 7B) in zero-shot and few-shot configurations. The evaluation was conducted on a Gold Standard of 200 papers from a corpus of 2036 records on blockchain-based e-voting. The best-performing configuration—a single-agent strategy with Qwen 2.5 7B in few-shot mode—achieved recall of 100%, precision of 70.4%, F1 of 82.6%, and a 43.4% reduction in manual screening effort, outperforming all multi-agent alternatives. Confidence-weighted aggregation produced results identical to majority voting, indicating that self-reported confidence from 7–8B parameter models did not add discriminative value. All screening decisions were logged on a private blockchain with timestamped anchoring for reproducibility. These results suggest that, for domain-specific screening tasks, careful model selection outweighs multi-agent coordination overhead, and that few-shot prompting with a well-matched model can achieve human-level recall with substantially reduced manual effort.

Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening

Key Points

Abstract

Cite This Study