Abstract This study investigates the capabilities of Large Language Models (LLMs) in evaluating Arabic academic research, focusing on the performance of GPT-4 and Claude across multiple assessment dimensions. Through a pilot analysis of 60 strategically selected Arabic academic papers from social sciences and humanities using institutional quality classifications, we examined the models’ ability to assess research quality using criteria adapted from the Research Excellence Framework. The study employed various input formats (full text, abstract, and title) and analyzed model performance across three key components: originality, significance, and rigor. Results demonstrate distinct patterns between the models, with Claude achieving higher agreement rates in specific components (91.7% for rigor) despite poor overall correlations with human assessments but exhibiting an upward scoring bias (mean score 3.27 for full text), while ChatGPT displayed more conservative scoring patterns (mean score 2.54) with greater stability across iterations (SD 0.15–022). Both models showed stronger performance in evaluating methodological rigor (human-model correlation 0.27) compared to originality assessment (correlation -0.03). Performance degraded significantly with reduced input length, particularly for title-only evaluations (MAD increasing from 0.49 to 1.09 for ChatGPT). The findings suggest that while current LLMs show promise in supporting Arabic academic evaluation, particularly in structured assessment components, they are better suited as supplementary tools rather than standalone evaluation systems. This study contributes to understanding the potential and limitations of automated research assessment in non-English academic contexts and highlights the importance of developing culturally aware evaluation systems.
Shehata et al. (Thu,) studied this question.