ABSTRACTObjectives Evaluating compliance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement can be time-consuming and subjective. This study compares STROBE assessments from large language models (LLMs), a human reviewer panel, and the original manuscript authors in observational rheumatology research. Methods Guided by the Guidelines for Reporting Reliability and Agreement Studies and Development, Evaluation, and Assessment of Large Language Model Pathway B frameworks, 17 rheumatology articles (11 cohort, 4 cross-sectional, and 2 case-control) were independently assessed. Evaluations used the 22-item STROBE checklist, completed by the authors, a 5-person human panel (ranging from junior to senior professionals), and 2 LLMs (ChatGPT-5.2 and Gemini 3 Pro). Interrater reliability was calculated using Gwet's Agreement Coefficient (AC1) with 95% CIs. Results Overall agreement across all reviewers was 85.0% (AC1 = 0.826 95% CI: 0.801-0.851). Domain stratification showed almost perfect agreement for ‘Presentation & Context' (AC1 = 0.841 95% CI: 0.810-0.872) and substantial agreement for ‘Methodological Rigor' (AC1 = 0.803 95% CI: 0.761-0.845). Although LLMs achieved complete agreement with all human reviewers on standard formatting elements, their agreement declined on complex methodological items, with some pairwise comparisons yielding negative AC1 values. Intra-LLM and cross-version agreement across repeated independent runs was high, and estimates were stable across publication periods, providing no clear evidence of data leakage. Conclusions While LLMs show potential for basic STROBE screening, their lower agreement with human experts on complex methodological items likely reflects a reliance on surface-level information. These models appear more reliable for standardising straightforward checks than for replacing expert human judgement in evaluating observational research.
Bilgin et al. (Fri,) studied this question.