What type of study is this?

This is a Observational study (also classified as: Literature Review).

August 16, 2025Open Access

Assessing risk of bias of cohort studies with large language models

Key Points

The assessment of risk of bias revealed overall accuracy rates between 80.8% and 83.3% across three LLMs.
Moonshot-v1-128k exhibited superior sensitivity in population selection at 0.92 compared to ChatGPT-4o’s 0.55.
ChatGPT-4o showed the highest consistency with a mean kappa of 96.5% and perfect agreement in outcome confidence.
ChatGPT-4o was also significantly faster, processing assessments in 32.8 seconds, compared to 20 minutes manually.

Abstract

Abstract This study aims to explore the feasibility and accuracy of utilizing large language models (LLMs) to assess the risk of bias (ROB) in cohort studies. We conducted a pilot and feasibility study in 30 cohort studies randomly selected from reference lists of published Cochrane reviews. We developed a structured prompt to guide the ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 to assess the ROB of each cohort twice. We used the ROB results assessed by three evidence-based medicine experts as the gold standard, and then we evaluated the accuracy of LLMs by calculating the correct assessment rate, sensitivity, specificity, and F 1 scores for overall and item-specific levels. The consistency of the overall and item-specific assessment results was evaluated using Cohen’s kappa (κ) and prevalence-adjusted bias-adjusted kappa. Efficiency was estimated by the mean assessment time required. This study assessed three LLMs (ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3) and revealed distinct performance across eight assessment items. Overall accuracy was comparable (80.8%–83.3%). Moonshot-v1-128k showed superior sensitivity in population selection (0.92 versus ChatGPT-4o’s 0.55, P < 0.001). In terms of F 1 scores, Moonshot-v1-128k led in population selection ( F = 0.80 versus ChatGPT-4o’s 0.67, P = 0.004). ChatGPT-4o demonstrated the highest consistency (mean κ = 96.5%), with perfect agreement (100%) in outcome confidence. ChatGPT-4o was 97.3% faster per article (32.8 seconds versus 20 minutes manually) and outperformed Moonshot-v1-128k and DeepSeek-V3 by 47–50% in processing speed. The efficient and accurate assessment of ROB in cohort studies by ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 highlights the potential of LLMs to enhance the systematic review process.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Danni Xia

Honghao Lai

Weilong Zhao

Journals

Research Synthesis Methods

Actions

Institutions

McMaster University

Peking University

Lanzhou University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Assessing risk of bias of cohort studies with large language models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study