What question did this study set out to answer?

To evaluate the effectiveness of AI-generated reviews in peer review and explore their alignment with human assessment.

February 16, 2026

Co-Reviewer: can AI review like a human? An agentic framework for LLM-human alignment in peer review

Key Points

To evaluate the effectiveness of AI-generated reviews in peer review and explore their alignment with human assessment.
Introduced an AI framework called Co-Reviewer composed of four specialized agents.
Conducted a multi-dimensional evaluation comparing LLM-generated reviews with human-written reviews.
Used evaluation metrics including content informativeness and sentiment polarity.
AI-generated reviews are well-written but show excessive confidence and bias towards acceptance.
LLMs struggle with nuanced critique and context-sensitive reasoning found in human evaluations.
Proposed enhancements include domain-adaptive fine-tuning and hybrid human-LLM review pipelines.

Abstract

The peer review process serves as a critical gatekeeper in scholarly communication; it provides constructive feedback, determines the credibility of research, and validates the scientific claims and overall quality of research papers. However, human reviews are often subjective and inconsistent. Due to the voluntary nature of the reviewing task, reviewers may not always devote time to thoroughly evaluating manuscripts. The peer review process remains vulnerable to bias and lackluster evaluations. Recent advancements in Large Language Models (LLMs) offer a promising testbed for their potential for automating or augmenting the peer review process that can complement or benchmark human reviewers. However, the potential of large language models (LLMs) remains unexplored regarding the extent to which these models can replicate human evaluation, particularly in terms of critical depth, reasoning accuracy, and alignment with human decision-making. To test this hypothesis, In this paper, we introduce Co-Reviewer , an agentic AI framework composed of four specialized agents that work together to generate, evaluate, critique, and refine peer reviews. Additionally, we conduct a multi-dimensional evaluation comparing LLM-generated reviews with human-written reviews, using evaluation metrics such as content informativeness, sentiment polarity and variability, score consistency, and alignment with final editorial decisions. Our research shows that while LLMs can create well-written and clear reviews, they have consistent problems like sounding too confident, favoring acceptance, and struggling to adjust to changes in manuscripts. Additionally, LLMs often confuse linguistic fluency with substantive critique, missing the nuanced and context-sensitive reasoning found in expert human assessments. To address these limitations, we propose several enhancements: domain-adaptive fine-tuning on peer review datasets, structured aspect-based critique generation, sentiment modulation for more calibrated feedback, and hybrid pipelines that combine LLM outputs with human oversight. Our work contributes to the growing body of research on AI-assisted scholarly evaluation and underscores both the potential and the limitations of using LLMs as Co-Reviewer in academic publishing workflows. The dataset and code that replicate our findings are publicly available at https://github.com/PrabhatkrBharti/Co-Reviewer.git .

AI से पूछें

Bookmark

AI से पूछें

Bookmark

Co-Reviewer: can AI review like a human? An agentic framework for LLM-human alignment in peer review

Key Points

Abstract

Cite This Study