What question did this study set out to answer?

This study aims to evaluate the agreement among large language models, human reviewers, and authors in assessing STROBE compliance in observational rheumatology studies.

May 31, 2026Open Access

Agreement between large language models, human reviewers, and authors in evaluating STROBE checklists for observational studies in rheumatology

Key Points

This study aims to evaluate the agreement among large language models, human reviewers, and authors in assessing STROBE compliance in observational rheumatology studies.
17 rheumatology articles independently assessed using a 22-item STROBE checklist
Evaluations conducted by a 5-person human panel and 2 large language models (ChatGPT-5.2 and Gemini 3 Pro)
Interrater reliability assessed using Gwet's Agreement Coefficient (AC1) with 95% CIs.
Overall agreement among reviewers was 85.0% (AC1 = 0.826 [95% CI: 0.801-0.851])
Almost perfect agreement for ‘Presentation & Context' (AC1 = 0.841 [95% CI: 0.810-0.872])
LLMs displayed complete agreement on standard formatting but reduced agreement on complex methodological items.

Abstract

ABSTRACTObjectives Evaluating compliance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement can be time-consuming and subjective. This study compares STROBE assessments from large language models (LLMs), a human reviewer panel, and the original manuscript authors in observational rheumatology research. Methods Guided by the Guidelines for Reporting Reliability and Agreement Studies and Development, Evaluation, and Assessment of Large Language Model Pathway B frameworks, 17 rheumatology articles (11 cohort, 4 cross-sectional, and 2 case-control) were independently assessed. Evaluations used the 22-item STROBE checklist, completed by the authors, a 5-person human panel (ranging from junior to senior professionals), and 2 LLMs (ChatGPT-5.2 and Gemini 3 Pro). Interrater reliability was calculated using Gwet's Agreement Coefficient (AC1) with 95% CIs. Results Overall agreement across all reviewers was 85.0% (AC1 = 0.826 95% CI: 0.801-0.851). Domain stratification showed almost perfect agreement for ‘Presentation & Context' (AC1 = 0.841 95% CI: 0.810-0.872) and substantial agreement for ‘Methodological Rigor' (AC1 = 0.803 95% CI: 0.761-0.845). Although LLMs achieved complete agreement with all human reviewers on standard formatting elements, their agreement declined on complex methodological items, with some pairwise comparisons yielding negative AC1 values. Intra-LLM and cross-version agreement across repeated independent runs was high, and estimates were stable across publication periods, providing no clear evidence of data leakage. Conclusions While LLMs show potential for basic STROBE screening, their lower agreement with human experts on complex methodological items likely reflects a reliance on surface-level information. These models appear more reliable for standardising straightforward checks than for replacing expert human judgement in evaluating observational research.

Bookmark

View Full Paper

Bookmark

View Full Paper

Agreement between large language models, human reviewers, and authors in evaluating STROBE checklists for observational studies in rheumatology

Key Points

Abstract

Cite This Study