What does this research mean for the field?

ChatGPT-5 achieves the highest agreement with human risk-of-bias assessments in randomized controlled trials compared to other evaluated models. Novelty: ClaimNovelty.CONFIRMATORY. Consensus alignment: ConsensusAlignment.SUPPORTS_CONSENSUS.

What question did this study set out to answer?

The aim is to evaluate the reliability of large language models in assessing risk-of-bias in randomized clinical trials.

February 21, 2026

Assessing the Reliability of Large Language Models for Evaluation of Risk-of-Bias in Randomized Clinical Trials

Key Points

The aim is to evaluate the reliability of large language models in assessing risk-of-bias in randomized clinical trials.
Retrospectively analyzed 180 randomized controlled trials from systematic reviews.
Used a standardized prompt incorporating the Cochrane RoB 2 algorithm for model evaluation.
Assessed model performance using Cohen’s kappa and prevalence- and bias-adjusted kappa.
Evaluated intra-model reliability across three independent runs.
ChatGPT-5 achieved the highest agreement in assessing randomization (76%) and missing outcome data (80%).
Moderate concordance (69%) was observed for deviations from intended interventions.
All models struggled with selective reporting, showing only 47-51% agreement.
ChatGPT-5 showed superior overall risk-of-bias judgment concordance (60-62%) compared to others.

Abstract

Objective Systematic reviews depend on rigorous risk-of-bias (RoB) assessments to ensure credibility, yet manual evaluation using the Cochrane RoB 2 tool is resource-intensive. While Large Language Models (LLMs) offer potential for automation, their alignment with human judgment remains underexplored. This study evaluates the reliability of ChatGPT-4o, ChatGPT-5, and Claude 3.5 Sonnet in assessing RoB in randomized controlled trials (RCTs), comparing their agreement with human reviewers and internal consistency. Study Design We retrospectively analyzed 180 RCTs from systematic reviews published in the American Journal of Obstetrics and Gynecology (2021–2023) reporting complete human RoB 2 ratings. Each LLM processed full-text PDFs using a standardized prompt incorporating the complete RoB 2 algorithm. Model performance was evaluated against human benchmarks using Cohen’s kappa and prevalence- and bias-adjusted kappa (PABAK). Intra-model reliability was assessed across three independent runs to measure consistency. Results ChatGPT-5 consistently outperformed other models, achieving the highest agreement in randomization (Domain 1; 76%), missing outcome data (Domain 3; 80%), and outcome measurement (Domain 4; 76%). It showed moderate concordance for deviations from intended interventions (69%). However, all models struggled with selective reporting (Domain 5), where agreement dropped to 47–51%. For overall risk-of-bias judgments, ChatGPT-5 demonstrated superior concordance (60–62%, κ=0.36–0.40) compared to ChatGPT-4o (45%) and Claude 3.5 Sonnet (43%). ChatGPT-5 also exhibited substantial to near-perfect internal consistency. Conclusion Among the evaluated models, ChatGPT-5 most closely approximated human RoB 2 assessments and achieved superior internal consistency, suggesting it could serve as a practical first-pass tool to reduce reviewer burden. However, persistent limitations in detecting selective reporting—likely due to the inability to cross-reference external trial registries—highlight that expert human oversight remains essential for accurate evidence synthesis.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Takeshi Nagao

Tetsuya Kawakita

Journals

American Journal of Perinatology

Actions

Institutions

Old Dominion University

Jikei University School of Medicine

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Assessing the Reliability of Large Language Models for Evaluation of Risk-of-Bias in Randomized Clinical Trials

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study