What question did this study set out to answer?

This study aims to compare the effectiveness of AI-generated peer reviews with human reviews in cardiology publications.

May 29, 2026Open Access

AI as a Peer Reviewer: A Blinded Comparative Study of LLM-Generated and Human Reviews in a Cardiology Paper

Key Points

This study aims to compare the effectiveness of AI-generated peer reviews with human reviews in cardiology publications.
Analyzed 40 manuscripts submitted to a cardiology journal (20 accepted, 20 rejected) along with 77 human reviews.
Generated 41 peer reviews using a large language model in deep research mode for comparison.
Reviewed and scored 118 reviews (AI and human) by blinded editors across seven quality domains.
67.5% concordance for AI reviews and 71.9% for human reviews (p = 0.74).
AI recommended publication correctly in 75% of accepted manuscripts vs. 88% for humans; rejection in 60% vs. 56%.
AI reviews scored higher in five quality domains and took 2–6 minutes to generate compared to a median 17 days for human reviews.

Abstract

Abstract Background Peer review is a cornerstone of scientific quality control, yet it is increasingly burdened by growing manuscript volumes and reviewer fatigue. Large language models (LLMs) have emerged as potential tools to support scientific review, but it remains unclear whether AI-generated reviews are equivalent to human reviews on the endpoint that ultimately matters, agreement with the final editorial decision. Methods We retrieved 40 manuscripts previously submitted to a cardiology journal (20 ultimately accepted, 20 De Novo rejected) along with all available historical human peer reviews (n = 77). For each manuscript, we generated a corresponding peer review using LLM in deep research mode (n = 41). All 118 reviews were reformatted into a single anonymous template by two unblinded investigators and scored independently by two blinded editors across seven domains (digestion, focus, balance, suggestions, precision, politeness, conclusiveness; 0–2 scale). The primary endpoint was concordance between each reviewer recommendation (in favour of vs against publication) and the final editorial decision. Secondary endpoints were domain-specific quality scores and AI–human inter-rater agreement (Cohen's κ). Results Concordance with the final editorial decision was 67.5% for AI-generated reviews (27/40) and 71.9% for the human consensus (23/32 evaluable; p = 0.74). Stratified by editorial outcome, AI correctly recommended publication in 75% of accepted manuscripts and rejection in 60% of rejected manuscripts; the corresponding figures for the human consensus were 88% and 56%. AI-generated reviews scored significantly higher than human reviews in five of seven quality domains (focus, balance, suggestions, precision, conclusiveness; all p 0.05), with a higher total sum score (13.2 ± 0.9 vs. 11.4 ± 2.0; p 0.001). AI–human inter-rater agreement was substantial (κ = 0.73), exceeding human–human agreement on the same articles (κ = 0.54). AI reviews were generated in 2–6 minutes versus a median 17-day turnaround for human reviews. Conclusions LLM-generated peer reviews are non-inferior to human reviews in terms of agreement with the final editorial decision, while showing higher internal consistency, comparable quality on structured domains, and substantially shorter turnaround. These findings support the integration of AI as a complementary tool in editorial workflows, rather than as a replacement for human peer review.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper