Abstract Background Peer review is a cornerstone of scientific quality control, yet it is increasingly burdened by growing manuscript volumes and reviewer fatigue. Large language models (LLMs) have emerged as potential tools to support scientific review, but it remains unclear whether AI-generated reviews are equivalent to human reviews on the endpoint that ultimately matters, agreement with the final editorial decision. Methods We retrieved 40 manuscripts previously submitted to a cardiology journal (20 ultimately accepted, 20 De Novo rejected) along with all available historical human peer reviews (n = 77). For each manuscript, we generated a corresponding peer review using LLM in deep research mode (n = 41). All 118 reviews were reformatted into a single anonymous template by two unblinded investigators and scored independently by two blinded editors across seven domains (digestion, focus, balance, suggestions, precision, politeness, conclusiveness; 0–2 scale). The primary endpoint was concordance between each reviewer recommendation (in favour of vs against publication) and the final editorial decision. Secondary endpoints were domain-specific quality scores and AI–human inter-rater agreement (Cohen's κ). Results Concordance with the final editorial decision was 67.5% for AI-generated reviews (27/40) and 71.9% for the human consensus (23/32 evaluable; p = 0.74). Stratified by editorial outcome, AI correctly recommended publication in 75% of accepted manuscripts and rejection in 60% of rejected manuscripts; the corresponding figures for the human consensus were 88% and 56%. AI-generated reviews scored significantly higher than human reviews in five of seven quality domains (focus, balance, suggestions, precision, conclusiveness; all p 0.05), with a higher total sum score (13.2 ± 0.9 vs. 11.4 ± 2.0; p 0.001). AI–human inter-rater agreement was substantial (κ = 0.73), exceeding human–human agreement on the same articles (κ = 0.54). AI reviews were generated in 2–6 minutes versus a median 17-day turnaround for human reviews. Conclusions LLM-generated peer reviews are non-inferior to human reviews in terms of agreement with the final editorial decision, while showing higher internal consistency, comparable quality on structured domains, and substantially shorter turnaround. These findings support the integration of AI as a complementary tool in editorial workflows, rather than as a replacement for human peer review.
Zancanaro et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: