What does this research mean for the field?

Current large language models are better suited as supplementary tools rather than standalone systems for evaluating Arabic academic research, as they demonstrate poor overall correlation with human assessments despite showing some capability in assessing methodological rigor. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This study explores how effective large language models are in evaluating the quality of Arabic academic papers.

June 1, 2026

Towards automated evaluation of Arabic research: a study on the efficacy of large language models in analyzing the quality of Arabic academic papers

Key Points

This study explores how effective large language models are in evaluating the quality of Arabic academic papers.
Analyzed 60 Arabic academic papers selected from the social sciences and humanities.
Used criteria from the Research Excellence Framework to assess models' performance across originality, significance, and rigor.
Examined model performance with different input formats: full text, abstract, and title.
Claude achieved higher agreement rates for rigor (91.7%) but had a mean score of 3.27 for full text, showing an upward scoring bias.
ChatGPT had a lower mean score (2.54) with greater stability (SD 0.15–0.22) but performed poorly in originality assessment (correlation -0.03).
Performance greatly declined when evaluating titles only, with ChatGPT's mean absolute deviation increasing from 0.49 to 1.09.

Abstract

Abstract This study investigates the capabilities of Large Language Models (LLMs) in evaluating Arabic academic research, focusing on the performance of GPT-4 and Claude across multiple assessment dimensions. Through a pilot analysis of 60 strategically selected Arabic academic papers from social sciences and humanities using institutional quality classifications, we examined the models’ ability to assess research quality using criteria adapted from the Research Excellence Framework. The study employed various input formats (full text, abstract, and title) and analyzed model performance across three key components: originality, significance, and rigor. Results demonstrate distinct patterns between the models, with Claude achieving higher agreement rates in specific components (91.7% for rigor) despite poor overall correlations with human assessments but exhibiting an upward scoring bias (mean score 3.27 for full text), while ChatGPT displayed more conservative scoring patterns (mean score 2.54) with greater stability across iterations (SD 0.15–022). Both models showed stronger performance in evaluating methodological rigor (human-model correlation 0.27) compared to originality assessment (correlation -0.03). Performance degraded significantly with reduced input length, particularly for title-only evaluations (MAD increasing from 0.49 to 1.09 for ChatGPT). The findings suggest that while current LLMs show promise in supporting Arabic academic evaluation, particularly in structured assessment components, they are better suited as supplementary tools rather than standalone evaluation systems. This study contributes to understanding the potential and limitations of automated research assessment in non-English academic contexts and highlights the importance of developing culturally aware evaluation systems.

Bookmark

Cite This Study

Shehata et al. (Thu,) studied this question.

synapsesocial.com/papers/6a1d221f02fbce9130637f06 https://doi.org/https://doi.org/10.1093/reseval/rvag026

Bookmark