Citation screening in systematic review is timeconsuming. Machine learning can help semi-automate it but faces obstacles. Each systematic review is a new dataset without initial annotations. Extreme class imbalance against irrelevant studies makes it difficult to select a good subset of samples to train a classifier. The rigid requirement of a (near) total recall of relevant studies demands a careful trade-off between accuracy and recall. This paper pilots a weak classifier ensemble approach to tackle both challenges. The idea of ensembling is employed in two ways. First, multiple cost-effective large language models are applied and averaged to score and rank candidate studies to create a balanced pseudo-labelled training set. Second, different sets of pseudo-negative samples are bootstrapped from low-rank documents and multiple classifiers are trained and combined to make screening decisions. Experiments on 28 systematic reviews demonstrate significant performance improvements brought by the weakly supervised classifier ensemble, which also meets the rigid recall requirement for it to be safely used in practice.
Jiang et al. (Mon,) studied this question.