August 9, 2025

Performance of LLMs in Citation Screening: A Comparison Across Datasets with Varied Inclusion Rates.

Key Points

The study explores citation screening using large language models across datasets with different inclusion rates.
Performance varied among the six tested LLMs, with none consistently superior across all datasets.
Ensemble learning and majority voting techniques were applied to enhance citation screening performance.
Sensitivity and specificity results showed strong dependence on the varying inclusion rates within the datasets.

Abstract

Systematic reviews involve time-intensive processes of screening titles, abstracts, and full texts to identify relevant studies. This study evaluates the potential of large language models (LLMs) to automate citation screening across three datasets with varying inclusion rates. Six LLMs were tested using zero- to five-shot in context-learning, with demonstration selection using PubMedBERT for semantic similarity. Majority voting and ensemble learning were applied to enhance performance. Results showed that no single LLM consistently excelled across the datasets, with sensitivity and specificity influenced by inclusion rates. Overall, ensemble learning and majority voting improved performance in citation screening.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zhihong Zhang

M. Nezhad

Pallavi Gupta

Actions

Institutions

Columbia University

Columbia University Irving Medical Center

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Performance of LLMs in Citation Screening: A Comparison Across Datasets with Varied Inclusion Rates.

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study